This guide covers common issues you may encounter when using MaxDiffusion and how to resolve them.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/AI-Hypercomputer/maxdiffusion/llms.txt
Use this file to discover all available pages before exploring further.
Compilation issues
Compilation takes too long or hangs
Compilation takes too long or hangs
-
Use JAX compilation cache to avoid recompiling:
-
Reduce model or batch size during initial testing:
-
Check LIBTPU_INIT_ARGS - some flag combinations can slow compilation:
-
Enable profiler to see where it’s stuck:
XLA compilation errors or mismatched shapes
XLA compilation errors or mismatched shapes
-
Verify parallelism settings match your hardware:
-
Check batch size divisibility:
-
For Wan models, verify head parallelism divides 40:
-
Disable jit_initializers for debugging:
Incompatible dtype errors
Incompatible dtype errors
-
Match weights and activations dtypes:
-
Use float32 for higher precision (slower):
-
For GPU, ensure Transformer Engine is installed when using cudnn_flash_te:
Out of memory (OOM) errors
TPU/GPU runs out of memory during training
TPU/GPU runs out of memory during training
-
Reduce batch size:
-
Enable gradient checkpointing (rematerialization):
-
Use smaller flash block sizes:
-
Reduce resolution or number of frames:
-
Increase FSDP parallelism to shard model across more devices:
-
For Wan, adjust scoped_vmem_limit:
Out of memory during checkpoint loading
Out of memory during checkpoint loading
-
Enable single replica checkpoint restoring:
-
For Wan models, use external disk for HuggingFace cache:
-
Load weights in bfloat16:
Out of memory during data preprocessing
Out of memory during data preprocessing
-
Process in smaller batches:
-
Increase number of shards:
-
Use streaming dataset instead of in-memory:
Disk space issues
Insufficient disk space for checkpoints or datasets
Insufficient disk space for checkpoints or datasets
-
Attach external disk to VM:
-
Save checkpoints to GCS instead of local disk:
-
Disable checkpoint saving during debugging:
-
Clean up HuggingFace cache:
-
Use smaller dataset or streaming:
Dataset download fills up disk
Dataset download fills up disk
-
Use streaming dataset:
-
Download to external disk:
-
Download directly to GCS:
Permission and access errors
HuggingFace authentication errors for gated models
HuggingFace authentication errors for gated models
- Obtain access to the model on HuggingFace (e.g., Flux, Wan).
-
Create HuggingFace token:
- Go to https://huggingface.co/settings/tokens
- Create a token with read permissions
-
Set token in config or environment:
Or:
GCS permission errors
GCS permission errors
-
Authenticate gcloud:
-
Set project:
-
Grant VM service account permissions:
-
Check bucket exists and is accessible:
Permission denied when writing to disk
Permission denied when writing to disk
-
Check directory permissions:
-
Use home directory or /tmp:
-
Run with appropriate user:
Training and inference issues
Loss is NaN or training diverges
Loss is NaN or training diverges
-
Reduce learning rate:
-
Enable gradient clipping:
-
Use float32 instead of bfloat16:
- Check data preprocessing - ensure images/videos are normalized correctly.
- Reduce batch size - very large batches can cause instability.
Generated images/videos have poor quality
Generated images/videos have poor quality
-
Increase inference steps:
-
Adjust guidance scale:
-
For Wan models, set flow_shift:
-
Use higher precision:
- Check if model loaded correctly - verify checkpoint path and weights.
Slow training or inference performance
Slow training or inference performance
-
Enable flash attention:
- Optimize LIBTPU_INIT_ARGS - see optimization guide.
- Use appropriate flash block sizes for your TPU generation.
-
Cache latents and text encodings:
-
Enable profiler to identify bottlenecks:
-
For GPU, use fused attention:
Multihost issues
Multihost training hangs or crashes
Multihost training hangs or crashes
-
Enable distributed system initialization:
-
Ensure all hosts have same code version:
-
Check DCN parallelism settings:
- Verify network connectivity between hosts.
-
Use GCS for checkpoints not local disk:
Multihost data loading is slow
Multihost data loading is slow
-
Ensure enough data files - need more files than hosts:
-
Use GCS for data storage not local:
-
Enable data shuffling:
Getting help
If you’re still experiencing issues:- Check the logs for detailed error messages
- Enable profiler to identify performance bottlenecks
- Search GitHub issues: https://github.com/AI-Hypercomputer/maxdiffusion/issues
- File a bug report with:
- Complete error message and stack trace
- Hardware type (TPU v5p, v6e, GPU model)
- MaxDiffusion version and commit hash
- Full command or config used
- Steps to reproduce