Skip to main content
h2oGPT can use a remote inference server instead of loading a model locally. Pass --inference_server to generate.py with the appropriate server string, alongside --base_model to identify the model name (needed for prompt formatting).

Server string reference

Server type--inference_server format
oLLaMavllm_chat:http://localhost:11434/v1/
HF TGIhttp://<ip>:<port>
vLLM (completions)vllm:<ip>:<port>
vLLM (chat)vllm_chat:<ip>:<port>
OpenAI Chatopenai_chat
OpenAI Textopenai
Azure OpenAIopenai_azure_chat:<deployment>:<base_url>:<api_version>
Anthropicanthropic
MistralAImistralai
Googlegoogle
Groqgroq
Replicatereplicate:<model_string>
AWS SageMakersagemaker_chat:<endpoint>:<region>
Gradiohttp://<ip>:<port>

oLLaMa

oLLaMa exposes an OpenAI-compatible API endpoint that h2oGPT connects to via the vllm_chat server type.
1

Start the oLLaMa server

ollama run llama2
2

Connect h2oGPT

python generate.py \
  --base_model=llama2 \
  --inference_server=vllm_chat:http://localhost:11434/v1/ \
  --prompt_type=openai_chat \
  --max_seq_len=4096
To run on specific GPUs, stop the oLLaMa system service first:
sudo systemctl stop ollama.service
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:11434 ollama serve &> ollama.log &
ollama run mistral:v0.3
Then connect h2oGPT:
python generate.py \
  --base_model=mistral:v0.3 \
  --inference_server=vllm_chat:http://localhost:11434/v1/ \
  --prompt_type=openai_chat \
  --max_seq_len=8094
You can also configure the model entirely from the UI: start python generate.py with no arguments, then in the Models tab enter the model name and server URL, set prompt_type to plain, and set max_seq_len to 4096.

HuggingFace Text Generation Inference (TGI)

HF TGI is a high-throughput inference server from HuggingFace. Docker is the recommended install method.

Connecting h2oGPT to TGI

Once TGI is running (e.g. at 192.168.1.46:6112), connect h2oGPT:
SAVE_DIR=./save/ python generate.py \
  --inference_server="http://192.168.1.46:6112" \
  --base_model=h2oai/h2ogpt-oasst1-512-12b

Testing TGI

from text_generation import Client

client = Client("http://127.0.0.1:6112")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)
curl 127.0.0.1:6112/generate \
  -X POST \
  -d '{"inputs":"<|prompt|>What is Deep Learning?<|endoftext|><|answer|>","parameters":{"max_new_tokens": 512, "truncate": 1024, "do_sample": true, "temperature": 0.1, "repetition_penalty": 1.2}}' \
  -H 'Content-Type: application/json'

vLLM

vLLM provides an OpenAI-compatible API server with high throughput via PagedAttention. It requires CUDA 12.1+ for best results.
1

Create a vLLM environment

conda create -n vllm -y
conda activate vllm
conda install python=3.10 -y
sudo apt update && sudo apt install libnccl2 libnccl-dev
2

Install vLLM

export CUDA_HOME=/usr/local/cuda-12.1
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu121"
pip install vllm
3

Start the vLLM server

# LLaMa-2 70B on 4 GPUs
export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m vllm.entrypoints.openai.api_server \
  --port=5000 \
  --host=0.0.0.0 \
  --model h2oai/h2ogpt-4096-llama2-70b-chat \
  --tokenizer=hf-internal-testing/llama-tokenizer \
  --tensor-parallel-size=4 \
  --seed 1234 \
  --max-num-batched-tokens=8192
4

Connect h2oGPT

python generate.py \
  --inference_server="vllm:0.0.0.0:5000" \
  --base_model=h2oai/h2ogpt-oasst1-falcon-40b \
  --langchain_mode=UserData

Mixtral 8×7B with vLLM

export CUDA_VISIBLE_DEVICES=0,1
python -m vllm.entrypoints.openai.api_server \
  --port=5002 \
  --host=0.0.0.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --seed 1234 \
  --max-num-batched-tokens=65536 \
  --tensor-parallel-size=2
vllm_chat (ChatCompletion) is not supported by the vLLM project. Use vllm (completions) instead. If you add https:// or http:// as a prefix to the vLLM IP, also append /v1 to the full address.
vLLM >= 0.5.0 requires a CUDA driver version >= 12.4. If your driver is older, use vllm/vllm-openai:v0.4.2 instead of :latest.

vLLM AWQ example (Docker)

mkdir -p $HOME/.cache/huggingface/hub
docker run -d \
  --runtime=nvidia \
  --gpus '"device=0,1"' \
  --shm-size=10.24gb \
  -p 5000:5000 \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -v "${HOME}"/.cache:$HOME/.cache/ \
  --network host \
  vllm/vllm-openai:latest \
    --port=5000 \
    --host=0.0.0.0 \
    --model=h2oai/h2ogpt-4096-llama2-70b-chat-4bit \
    --tensor-parallel-size=2 \
    --seed 1234 \
    --trust-remote-code \
    --max-num-batched-tokens 8192 \
    --quantization awq \
    --download-dir=/workspace/.cache/huggingface/hub

Cloud provider APIs

Set the OPENAI_API_KEY environment variable, then:
OPENAI_API_KEY=<key> python generate.py \
  --inference_server=openai_chat \
  --base_model=gpt-3.5-turbo \
  --h2ocolors=False \
  --langchain_mode=UserData
OpenAI is not recommended for private document Q&A — document chunks are sent to OpenAI’s servers. Use it for testing or when privacy is not required.

Gradio server-to-server

You can connect h2oGPT as a client to another h2oGPT Gradio server. Start a server:
SAVE_DIR=./save/ python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b
Then connect a second h2oGPT instance to it:
python generate.py \
  --inference_server="http://192.168.0.10:7680" \
  --base_model=h2oai/h2ogpt-oasst1-falcon-40b
Gradio live share links (https://*.gradio.live) and ngrok tunnels also work as the --inference_server value. If the prompt_type is not automatically detected, pass it explicitly:
python generate.py \
  --inference_server="http://192.168.0.10:7680" \
  --base_model=foo_model \
  --prompt_type=llama2

Replicate

Set REPLICATE_API_TOKEN, install the package, and pass the Replicate model string:
pip install replicate
export REPLICATE_API_TOKEN=<key>

python generate.py \
  --inference_server="replicate:lucataco/llama-2-7b-chat:6ab580ab4eef2c2b440f2441ec0fc0ace5470edaf2cbea50b8550aec0b3fbd38" \
  --base_model="TheBloke/Llama-2-7b-Chat-GPTQ"
Replicate is not recommended for private document Q&A — only chunks of documents are sent to the LLM, but those chunks leave your infrastructure. It is sufficient when full privacy is not required.

AWS SageMaker

h2oGPT integrates with AWS SageMaker endpoints. Set your AWS credentials and pass the endpoint name and region:
export AWS_ACCESS_KEY_ID=<...>
export AWS_SECRET_ACCESS_KEY=<...>

python generate.py \
  --inference_server=sagemaker_chat:<endpointname>:<region> \
  --base_model=h2oai/h2ogpt-4096-llama2-7b-chat
Streaming is not yet supported for the LangChain SageMaker integration.

Locking multiple models

Use --model_lock to start h2oGPT with multiple inference servers active simultaneously. This enables side-by-side model comparison:
python generate.py --model_lock="[
  {'inference_server':'http://192.168.1.46:6112','base_model':'h2oai/h2ogpt-oasst1-512-12b'},
  {'inference_server':'http://192.168.1.46:6114','base_model':'h2oai/h2ogpt-oasst1-512-20b'},
  {'inference_server':'openai_chat','base_model':'gpt-3.5-turbo'}
]" --model_lock_columns=3
No spaces are allowed inside the double-quoted --model_lock argument due to CLI argument parsing. Use single quotes inside the JSON-like structure.

Selecting visible models at startup

export vis="['h2oai/h2ogpt-4096-llama2-70b-chat','HuggingFaceH4/zephyr-7b-alpha','gpt-3.5-turbo-0613']"
python generate.py \
  --model_lock="$MODEL_LOCK" \
  --visible_models="$vis"

In-app model management

When running h2oGPT, you can add and switch models without restarting. In the Models tab:
  1. Enter the model name (same as --base_model) and server URL (same as --inference_server).
  2. Click Add new Model, Lora, Server url:port.
  3. Select the model from the dropdown and click Load-Unload.
Click Load Model Names from Server to auto-populate model names from compatible inference servers (vLLM, oLLaMa, etc.).
--base_model is always required on the CLI. It is not auto-populated even when connecting to a server that exposes a model list endpoint.

Build docs developers (and LLMs) love