Training Issues
Why do I see garbled text during training?
Why do I see garbled text during training?
--load or --ref-load. Note that Megatron can only load a directory that contains a latest_checkpointed_iteration.txt file.If you need to specify a particular iteration, you can refer to the current Megatron usage instructions. Generally, you can specify the step number using --ckpt-step.Why is my task stuck on the Ray submission page?
Why is my task stuck on the Ray submission page?
- Whether the
--colocateparameter is set to enable co-located mode. - Whether the total number of GPUs for the current task is greater than or equal to
actor_num_nodes * actor_num_gpus_per_node.
- Whether the total number of GPUs for the current task is greater than or equal to
actor_num_nodes * actor_num_gpus_per_node + rollout_num_gpus.
Why did I encounter an Out-of-Memory (OOM) error during training? What is max_tokens_per_gpu for?
Why did I encounter an Out-of-Memory (OOM) error during training? What is max_tokens_per_gpu for?
max_tokens_per_gpu is set too high. This parameter defines the maximum number of tokens that can be processed on each GPU during training. If you are concerned about OOM, you can initially set this value to rollout_max_response_len / cp_size and then increase it later to improve training efficiency. Note that --max-tokens-per-gpu is only active when --use-dynamic-batch-size is enabled.If you still experience OOM with a small max_tokens_per_gpu, check if the data generated in a single pass is too long. You may need to enable context parallelism (CP) with --context-parallel-size. If you are using custom data generation, check if the total length of multi-turn generations is much longer than expected.How do I resume training?
How do I resume training?
--load directory to your --save directory.My gradient norm is very high and the training crashes. What should I do?
My gradient norm is very high and the training crashes. What should I do?
Gradient becomes NaN or Inf during training
Gradient becomes NaN or Inf during training
--no-check-for-nan-in-loss-and-grad flag to skip the corresponding training steps.Multi-node Training
During multi-node training, what should I do if the transformers library reports it cannot find a model?
During multi-node training, what should I do if the transformers library reports it cannot find a model?
AutoConfig.from_pretrained or AutoModelForCausalLM.from_pretrained, causing file system write conflicts. You can mitigate this issue by setting the --model-name argument.Batch Size and Data
How is the batch size calculated?
How is the batch size calculated?
rollout_batch_size prompts. For each prompt, n_samples_per_prompt samples are generated. Therefore, one rollout contains a total of rollout_batch_size * n_samples_per_prompt data entries.You can use --num-steps-per-rollout to determine how many steps to run per rollout. This is equivalent to setting the global_batch_size to rollout_batch_size * n_samples_per_prompt // num_steps_per_rollout.Does slime perform data packing / variable-length (varlen) processing?
Does slime perform data packing / variable-length (varlen) processing?
SGLang Issues
What should I do if the sglang component shows a Max retries exceeded with url: /get_model_info error?
What should I do if the sglang component shows a Max retries exceeded with url: /get_model_info error?
tp=8.My sglang generation takes an extremely long time, GPU power is maxed out, and there's no output for a long while. Why?
My sglang generation takes an extremely long time, GPU power is maxed out, and there's no output for a long while. Why?
--hf-checkpoint has its stop tokens configured correctly. If not, you can set them using the --rollout-stop or --rollout-stop-token-ids arguments.Sglang shows an 'an illegal memory access was encountered' error
Sglang shows an 'an illegal memory access was encountered' error
--sglang-mem-fraction-static.Compiler and Cache Issues
A JSONDecodeError occurs related to torch compile/inductor
A JSONDecodeError occurs related to torch compile/inductor