HuggingFace Transformers models and llama.cpp GGUF models are supported via CUDA on Linux and via MPS on macOS.
Basic 8-bit loading
The simplest way to run a model on GPU is with bitsandbytes 8-bit quantization. This roughly halves VRAM usage compared to full 16-bit precision:
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True
For production use, a 12B+ model is recommended:
python generate.py --base_model=HuggingFaceH4/zephyr-7b-beta --load_8bit=True
Then open your browser at http://0.0.0.0:7860 (Linux) or http://localhost:7860 (Windows/macOS).
For older GPUs, bitsandbytes 8-bit and 4-bit support may require downgrading: pip uninstall bitsandbytes -y && pip install bitsandbytes==0.38.1
4-bit loading
Pass --load_4bit=True for 4-bit NF4 quantization via bitsandbytes. This is supported for architectures including GPT-NeoX-20B, GPT-J, and LLaMa:
python generate.py --base_model=meta-llama/Llama-2-7b-chat-hf --load_4bit=True --prompt_type=llama2
8-bit vs 4-bit tradeoffs
| Setting | VRAM reduction | Quality | Speed |
|---|
--load_8bit=True | ~50% | Near full precision | Moderate |
--load_4bit=True | ~75% | Slight degradation | Moderate |
| AutoGPTQ 4-bit | ~75% | Good (GPTQ-optimized) | Fast (with CUDA kernels) |
| AutoAWQ 4-bit | ~75% | Good (AWQ-optimized) | Fast |
Memory requirements guidance
Approximate VRAM requirements at different precisions:
| Model size | FP16 | 8-bit | 4-bit / GPTQ |
|---|
| 7B | ~14 GB | ~7 GB | ~4–5 GB |
| 13B | ~26 GB | ~13 GB | ~7–9 GB |
| 70B | ~140 GB | ~70 GB | ~35–40 GB |
See the FAQ on larger models for detailed guidance.
AutoGPTQ
AutoGPTQ provides pre-quantized 4-bit models with CUDA kernel acceleration. Use --load_gptq to specify the weight file name:
If you see CUDA extension not installed during loading, recompile the AutoGPTQ CUDA kernels. Without recompilation, generation will be significantly slower even when using a GPU.
Run a 13B GPTQ model
python generate.py \
--base_model=TheBloke/Nous-Hermes-13B-GPTQ \
--score_model=None \
--load_gptq=model \
--use_safetensors=True \
--prompt_type=instruct \
--langchain_mode=UserData
This uses approximately 9800 MB of VRAM. Add --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 to reduce to ~9340 MB.Run LLaMa-2 7B GPTQ
python generate.py \
--base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
--load_gptq=model \
--use_safetensors=True \
--prompt_type=llama2 \
--save_dir=save
Run LLaMa-2 70B GPTQ (4-bit)
CUDA_VISIBLE_DEVICES=0 python generate.py \
--base_model=Llama-2-70B-chat-GPTQ \
--load_gptq=gptq_model-4bit--1g \
--use_safetensors=True \
--prompt_type=llama2 \
--save_dir=save
Achieves approximately 12 tokens/sec.
GPTQ with RoPE scaling (extended context)
To extend context beyond training length using RoPE scaling with GPTQ:
pip install transformers==4.31.0
python generate.py \
--base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
--load_gptq=model \
--use_safetensors=True \
--prompt_type=llama2 \
--score_model=None \
--rope_scaling="{'type':'dynamic', 'factor':4}" \
--max_max_new_tokens=15000 \
--max_new_tokens=15000 \
--max_time=12000
This configuration uses only ~5.5 GB of VRAM for a 7B model.
HuggingFace Transformers does not directly support GPTQ with RoPE scaling. Only exllama supports AutoGPTQ with RoPE scaling. vLLM supports LLaMa-2 and AutoGPTQ but not RoPE scaling.
AutoAWQ
AutoAWQ (activation-aware weight quantization) is another efficient 4-bit approach. Use --load_awq:
13B on 24 GB GPU
70B on 1×48 GB GPU
70B on 2×24 GB GPUs
CUDA_VISIBLE_DEVICES=0 python generate.py \
--base_model=TheBloke/Llama-2-13B-chat-AWQ \
--score_model=None \
--load_awq=model \
--use_safetensors=True \
--prompt_type=llama2
Uses approximately 14 GB of VRAM.CUDA_VISIBLE_DEVICES=0 python generate.py \
--base_model=TheBloke/Llama-2-70B-chat-AWQ \
--score_model=None \
--load_awq=model \
--use_safetensors=True \
--prompt_type=llama2
Uses approximately 39 GB of VRAM.CUDA_VISIBLE_DEVICES=2,3 python generate.py \
--base_model=TheBloke/Llama-2-70B-chat-AWQ \
--score_model=None \
--load_awq=model \
--use_safetensors=True \
--prompt_type=llama2
Multi-GPU usage
Automatic multi-GPU with HuggingFace
Set --use_gpu_id=False to let HuggingFace distribute the model across all available GPUs automatically:
pip install transformers==4.31.0
python generate.py \
--base_model=meta-llama/Llama-2-70b-chat-hf \
--prompt_type=llama2 \
--rope_scaling="{'type': 'linear', 'factor': 4}" \
--use_gpu_id=False \
--save_dir=savemeta70b
Running on 4×A6000 gives about 4 tokens/sec, consuming ~35 GB per GPU.
Multi-GPU with exllama
Use --exllama_dict to control GPU memory allocation explicitly across devices:
python generate.py \
--base_model=TheBloke/Llama-2-70B-chat-GPTQ \
--load_exllama=True \
--use_safetensors=True \
--use_gpu_id=False \
--load_gptq=main \
--prompt_type=llama2 \
--exllama_dict="{'set_auto_map':'20,20'}"
This splits the 70B model across 2 GPUs, each using 20 GB.
exllama
exllama is the only backend that supports AutoGPTQ with RoPE scaling. It is recommended for extended context inference on GPTQ models.
# LLaMa-2 7B with 16k context via RoPE scaling
python generate.py \
--base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
--load_gptq=model \
--use_safetensors=True \
--prompt_type=llama2 \
--save_dir=save \
--load_exllama=True \
--revision=gptq-4bit-32g-actorder_True \
--rope_scaling="{'alpha_value':4}"
# LLaMa-2 70B on single A6000 (~48 GB, ~12 tokens/sec)
python generate.py \
--base_model=TheBloke/Llama-2-70B-chat-GPTQ \
--load_gptq=gptq_model-4bit-128g \
--use_safetensors=True \
--prompt_type=llama2 \
--load_exllama=True \
--revision=main
When using exllama, set --concurrency_count=1. The exllama backend shares model state and will produce mixed-up responses for concurrent requests if concurrency is greater than 1.
llama.cpp on GPU
By default, when loading a GGUF model, h2oGPT sets n_gpu_layers to a large value so that llama.cpp offloads all transformer layers to GPU for maximum performance:
python generate.py \
--base_model=HuggingFaceH4/zephyr-7b-beta \
--prompt_type=zephyr \
--score_model=None
Check startup logs to confirm all layers are offloaded:
llama_model_load_internal: offloaded 35/35 layers to GPU
To partially offload (useful for low-VRAM GPUs), pass a specific layer count:
python generate.py --base_model=llama --llamacpp_dict="{'n_gpu_layers':20}"
Once llama-cpp-python is compiled with CUDA support, it no longer works in CPU-only mode. You will need a separate h2oGPT environment for CPU inference or reinstall llama-cpp-python without CUDA flags.