Backends API

Overview

Slime supports multiple training backends for different model architectures and scaling requirements.

Backend Selection

Backends are selected via the --train-backend argument:

# Megatron-LM backend (default)
python train.py --train-backend megatron

# FSDP backend (HuggingFace native)
python train.py --train-backend fsdp

Megatron Backend

Overview

Megatron-LM based backend for large-scale distributed training with advanced parallelism strategies. Features:

Tensor Parallelism (TP)
Pipeline Parallelism (PP)
Data Parallelism (DP)
Context Parallelism (CP)
Sequence Parallelism (SP)
Virtual Pipeline Parallelism
Expert Parallelism for MoE models

MegatronTrainRayActor

Main training actor for Megatron backend.

from slime.backends.megatron_utils.actor import MegatronTrainRayActor

class MegatronTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize Megatron training"""
    
    def async_train(self, rollout_id, rollout_data_ref):
        """Train on rollout data"""
    
    def save_model(self, rollout_id, force_sync=False):
        """Save checkpoint"""
    
    def update_weights(self):
        """Update inference engine weights"""

Source: slime/backends/megatron_utils/actor.py:45

Key Configuration

--tensor-model-parallel-size

int

default:"1"

Tensor parallelism degree

--pipeline-model-parallel-size

int

default:"1"

Pipeline parallelism degree

--context-parallel-size

int

default:"1"

Context parallelism for long sequences

--sequence-parallel

bool

Enable sequence parallelism

--num-layers

int

required

Number of transformer layers

--hidden-size

int

required

Hidden dimension size

--num-attention-heads

int

required

Number of attention heads

--qkv-format

str

default:"thd"

QKV tensor layout: “thd” or “bshd”

Supported Models

Dense Models:

LLaMA (1, 2, 3, 3.1)
Qwen2, Qwen2.5, Qwen3
GPT (GPT-OSS)
GLM4

MoE Models:

Qwen3-MoE
DeepSeekV3
GLM4-MoE
MIMO

Multimodal:

Qwen3-VL
Qwen3-Next (Vision)

Source: slime/backends/megatron_utils/

FSDP Backend

Overview

PyTorch FSDP2-based backend using native HuggingFace models. Features:

Fully Sharded Data Parallel (FSDP2)
Native HuggingFace model support
Optional CPU offloading
Gradient checkpointing
Mixed precision training

FSDPTrainRayActor

Training actor for FSDP backend.

from slime.backends.fsdp_utils.actor import FSDPTrainRayActor

class FSDPTrainRayActor(TrainRayActor):
    def init(self, args, role, with_ref=False, with_opd_teacher=False):
        """Initialize FSDP training"""
    
    def train(self, rollout_id, rollout_data):
        """Train on rollout data"""
    
    def save(self, rollout_id, force_sync=False):
        """Save FSDP checkpoint"""

Source: slime/backends/fsdp_utils/actor.py:34

Key Configuration

--fsdp-cpu-offload

bool

default:"False"

Enable FSDP CPU offloading for memory efficiency

--gradient-checkpointing

bool

default:"False"

Enable activation checkpointing

--optimizer

str

default:"adam"

Optimizer type (currently supports “adam”)

--attn-implementation

str

Attention implementation (e.g., “flash_attention_2”)

Supported Models

Dense Models:

Any HuggingFace AutoModel compatible model

MoE Models:

Qwen3-MoE (with custom kernel support)

Source: slime/backends/fsdp_utils/

SGLang Integration

Overview

SGLang provides high-performance inference engines for rollout generation. Features:

RadixAttention prefix caching
Multi-engine data parallelism
Speculative decoding
FP8 quantization
Continuous batching

Configuration

--sglang-tp-size

int

default:"1"

SGLang tensor parallelism size (same as --rollout-num-gpus-per-engine)

--sglang-dp-size

int

SGLang data parallelism size (number of engines)

--sglang-enable-torch-compile

bool

default:"False"

Enable torch.compile for SGLang

--sglang-speculative-algorithm

str

Speculative decoding algorithm (e.g., “eagle”)

--sglang-speculative-num-draft-tokens

int

default:"4"

Number of draft tokens for speculation

--sglang-enable-deterministic-inference

bool

default:"False"

Enable deterministic sampling

External SGLang

Use external SGLang instances instead of auto-launching:

python train.py \
  --rollout-external \
  --rollout-external-engine-addrs \
    http://10.0.0.1:30000 \
    http://10.0.0.2:30000

Source: slime/backends/sglang_utils/, slime/utils/arguments.py:459-470

Weight Update Mechanisms

UpdateWeightFromDistributed

Distributed weight transfer for non-colocated setups.

from slime.backends.megatron_utils.update_weight import UpdateWeightFromDistributed

updater = UpdateWeightFromDistributed(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # Transfer to SGLang

Mechanism:

Gather weights from training model
Convert Megatron format to HuggingFace format
Send to SGLang via HTTP chunked transfer
Apply optional quantization (FP8)

Source: slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py

UpdateWeightFromTensor

Direct tensor transfer for colocated setups.

from slime.backends.megatron_utils.update_weight import UpdateWeightFromTensor

updater = UpdateWeightFromTensor(
    args,
    model,
    weights_getter=lambda: model.state_dict(),
    model_name="qwen2"
)

updater.update_weights()  # In-memory transfer

Mechanism:

Get weight tensors from training model
Convert to HuggingFace format
Directly load into SGLang (shared memory)

Source: slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py

Backend-Specific Features

Megatron Features

Virtual Pipeline Parallelism:

python train.py \
  --num-layers-per-virtual-pipeline-stage 2

Expert Parallelism (MoE):

python train.py \
  --expert-model-parallel-size 8 \
  --moe-router-topk 6

Routing Replay:

python train.py \
  --use-routing-replay  # Training-time routing replay
  --use-rollout-routing-replay  # Rollout-time routing replay

Custom Model Provider:

python train.py \
  --custom-model-provider-path my_module.my_model_provider

FSDP Features

CPU Offloading:

python train.py \
  --train-backend fsdp \
  --fsdp-cpu-offload

Gradient Checkpointing:

python train.py \
  --train-backend fsdp \
  --gradient-checkpointing

Backend Comparison

Feature	Megatron	FSDP
Tensor Parallelism	✓	✗
Pipeline Parallelism	✓	✗
Context Parallelism	✓	✗
MoE Support	✓	Limited
Native HF Models	Via conversion	✓
Memory Efficiency	High (SP/CP)	Medium
Setup Complexity	High	Low
Best For	Large models (>70B)	Medium models (<70B)

Choosing a Backend

Use Megatron when:

Training models >70B parameters
Need tensor/pipeline parallelism
Training MoE models with expert parallelism
Require maximum scalability

Use FSDP when:

Training models <70B parameters
Want simple HuggingFace integration
Prefer lower setup complexity
Don’t need advanced parallelism

See Also:

Training API - Training functions
Arguments API - Backend configuration

Core Modules

Utilities

Backends API

Overview

Backend Selection

Megatron Backend

Overview

MegatronTrainRayActor

Key Configuration

Supported Models

FSDP Backend

Overview

FSDPTrainRayActor

Key Configuration

Supported Models

SGLang Integration

Overview

Configuration

External SGLang

Weight Update Mechanisms

UpdateWeightFromDistributed

UpdateWeightFromTensor

Backend-Specific Features

Megatron Features

FSDP Features

Backend Comparison

Choosing a Backend

Build docs developers (and LLMs) love

Core Modules

Utilities

Documentation Index

​Overview

​Backend Selection

​Megatron Backend

​Overview

​MegatronTrainRayActor

​Key Configuration

​Supported Models

​FSDP Backend

​Overview

​FSDPTrainRayActor

​Key Configuration

​Supported Models

​SGLang Integration

​Overview

​Configuration

​External SGLang

​Weight Update Mechanisms

​UpdateWeightFromDistributed

​UpdateWeightFromTensor

​Backend-Specific Features

​Megatron Features

​FSDP Features

​Backend Comparison

​Choosing a Backend

Build docs developers (and LLMs) love

Overview

Backend Selection

Megatron Backend

Overview

MegatronTrainRayActor

Key Configuration

Supported Models

FSDP Backend

Overview

FSDPTrainRayActor

Key Configuration

Supported Models

SGLang Integration

Overview

Configuration

External SGLang

Weight Update Mechanisms

UpdateWeightFromDistributed

UpdateWeightFromTensor

Backend-Specific Features

Megatron Features

FSDP Features

Backend Comparison

Choosing a Backend