Inference servers

h2oGPT can use a remote inference server instead of loading a model locally. Pass --inference_server to generate.py with the appropriate server string, alongside --base_model to identify the model name (needed for prompt formatting).

Server string reference

Server type	`--inference_server` format
oLLaMa	`vllm_chat:http://localhost:11434/v1/`
HF TGI	`http://<ip>:<port>`
vLLM (completions)	`vllm:<ip>:<port>`
vLLM (chat)	`vllm_chat:<ip>:<port>`
OpenAI Chat	`openai_chat`
OpenAI Text	`openai`
Azure OpenAI	`openai_azure_chat:<deployment>:<base_url>:<api_version>`
Anthropic	`anthropic`
MistralAI	`mistralai`
Google	`google`
Groq	`groq`
Replicate	`replicate:<model_string>`
AWS SageMaker	`sagemaker_chat:<endpoint>:<region>`
Gradio	`http://<ip>:<port>`

oLLaMa

oLLaMa exposes an OpenAI-compatible API endpoint that h2oGPT connects to via the vllm_chat server type.

Start the oLLaMa server

ollama run llama2

Connect h2oGPT

python generate.py \
  --base_model=llama2 \
  --inference_server=vllm_chat:http://localhost:11434/v1/ \
  --prompt_type=openai_chat \
  --max_seq_len=4096

To run on specific GPUs, stop the oLLaMa system service first:

sudo systemctl stop ollama.service
CUDA_VISIBLE_DEVICES=0 OLLAMA_HOST=0.0.0.0:11434 ollama serve &> ollama.log &
ollama run mistral:v0.3

Then connect h2oGPT:

python generate.py \
  --base_model=mistral:v0.3 \
  --inference_server=vllm_chat:http://localhost:11434/v1/ \
  --prompt_type=openai_chat \
  --max_seq_len=8094

You can also configure the model entirely from the UI: start python generate.py with no arguments, then in the Models tab enter the model name and server URL, set prompt_type to plain, and set max_seq_len to 4096.

HuggingFace Text Generation Inference (TGI)

HF TGI is a high-throughput inference server from HuggingFace. Docker is the recommended install method.

Docker (recommended)
Local install (not recommended)

Install Docker and the NVIDIA container toolkit, then run the TGI image:

# Falcon 7B on GPU 0
docker run --gpus device=0 --shm-size 2g -p 6112:80 \
  -v $HOME/.cache/huggingface/hub/:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2 \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --sharded=false \
  --disable-custom-kernels \
  --trust-remote-code \
  --max-stop-sequences=6

# 20B NeoX on 4 GPUs
docker run --gpus '"device=0,1,2,3"' --shm-size 2g -p 6112:80 \
  -v $HOME/.cache/huggingface/hub/:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id h2oai/h2ogpt-oasst1-512-20b \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --sharded=true \
  --num-shard=4 \
  --disable-custom-kernels \
  --trust-remote-code \
  --max-stop-sequences=6

# LLaMa-2 70B on 4×A100 GPUs
export MODEL=meta-llama/Llama-2-70b-chat-hf
docker run -d --gpus '"device=0,1,2,3"' --shm-size 1g \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -p 6112:80 \
  -v $HOME/.cache/huggingface/hub/:/data \
  ghcr.io/huggingface/text-generation-inference:0.9.3 \
  --model-id $MODEL \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-stop-sequences 6 \
  --sharded true \
  --num-shard 4

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

git clone https://github.com/huggingface/text-generation-inference.git
cd text-generation-inference

sudo apt-get install libssl-dev gcc -y

conda create -n textgen -y
conda activate textgen
conda install python=3.10 -y
export CUDA_HOME=/usr/local/cuda-11.7
BUILD_EXTENSIONS=True make install
cd server && make install install-flash-attention

NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 \
  text-generation-launcher \
  --model-id h2oai/h2ogpt-oig-oasst1-512-6_9b \
  --port 8080 \
  --sharded false \
  --trust-remote-code \
  --max-stop-sequences=6

Use BUILD_EXTENSIONS=False for GPUs below A100.

Connecting h2oGPT to TGI

Once TGI is running (e.g. at 192.168.1.46:6112), connect h2oGPT:

SAVE_DIR=./save/ python generate.py \
  --inference_server="http://192.168.1.46:6112" \
  --base_model=h2oai/h2ogpt-oasst1-512-12b

Testing TGI

from text_generation import Client

client = Client("http://127.0.0.1:6112")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)

curl 127.0.0.1:6112/generate \
  -X POST \
  -d '{"inputs":"<|prompt|>What is Deep Learning?<|endoftext|><|answer|>","parameters":{"max_new_tokens": 512, "truncate": 1024, "do_sample": true, "temperature": 0.1, "repetition_penalty": 1.2}}' \
  -H 'Content-Type: application/json'

vLLM

vLLM provides an OpenAI-compatible API server with high throughput via PagedAttention. It requires CUDA 12.1+ for best results.

Create a vLLM environment

conda create -n vllm -y
conda activate vllm
conda install python=3.10 -y
sudo apt update && sudo apt install libnccl2 libnccl-dev

Install vLLM

export CUDA_HOME=/usr/local/cuda-12.1
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu121"
pip install vllm

Start the vLLM server

# LLaMa-2 70B on 4 GPUs
export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m vllm.entrypoints.openai.api_server \
  --port=5000 \
  --host=0.0.0.0 \
  --model h2oai/h2ogpt-4096-llama2-70b-chat \
  --tokenizer=hf-internal-testing/llama-tokenizer \
  --tensor-parallel-size=4 \
  --seed 1234 \
  --max-num-batched-tokens=8192

Connect h2oGPT

python generate.py \
  --inference_server="vllm:0.0.0.0:5000" \
  --base_model=h2oai/h2ogpt-oasst1-falcon-40b \
  --langchain_mode=UserData

Mixtral 8×7B with vLLM

export CUDA_VISIBLE_DEVICES=0,1
python -m vllm.entrypoints.openai.api_server \
  --port=5002 \
  --host=0.0.0.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --seed 1234 \
  --max-num-batched-tokens=65536 \
  --tensor-parallel-size=2

vllm_chat (ChatCompletion) is not supported by the vLLM project. Use vllm (completions) instead. If you add https:// or http:// as a prefix to the vLLM IP, also append /v1 to the full address.

vLLM >= 0.5.0 requires a CUDA driver version >= 12.4. If your driver is older, use vllm/vllm-openai:v0.4.2 instead of :latest.

vLLM AWQ example (Docker)

mkdir -p $HOME/.cache/huggingface/hub
docker run -d \
  --runtime=nvidia \
  --gpus '"device=0,1"' \
  --shm-size=10.24gb \
  -p 5000:5000 \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -v "${HOME}"/.cache:$HOME/.cache/ \
  --network host \
  vllm/vllm-openai:latest \
    --port=5000 \
    --host=0.0.0.0 \
    --model=h2oai/h2ogpt-4096-llama2-70b-chat-4bit \
    --tensor-parallel-size=2 \
    --seed 1234 \
    --trust-remote-code \
    --max-num-batched-tokens 8192 \
    --quantization awq \
    --download-dir=/workspace/.cache/huggingface/hub

Cloud provider APIs

Set the OPENAI_API_KEY environment variable, then:

OPENAI_API_KEY=<key> python generate.py \
  --inference_server=openai_chat \
  --base_model=gpt-3.5-turbo \
  --h2ocolors=False \
  --langchain_mode=UserData

OpenAI is not recommended for private document Q&A — document chunks are sent to OpenAI’s servers. Use it for testing or when privacy is not required.

OPENAI_API_KEY=<key> python generate.py \
  --inference_server="openai_azure_chat:<deployment_name>:<base_url>:<api_version>" \
  --base_model=gpt-3.5-turbo \
  --h2ocolors=False \
  --langchain_mode=UserData

<deployment_name> is required; other fields can be set to None or left empty between : separators.

Set ANTHROPIC_API_KEY, then:

python generate.py \
  --inference_server=anthropic \
  --base_model=claude-3-opus-20240229

Other Claude models: claude-3-sonnet-20240229, claude-3-haiku-20240307.

python generate.py --model_lock="[{'inference_server':'mistralai', 'base_model':'mistral-medium'}]"

Set GOOGLE_API_KEY, then:

python generate.py --model_lock="[{'inference_server':'google', 'base_model':'gemini-pro'}]"

Set GROQ_API_KEY, then:

python generate.py --model_lock="[{'inference_server':'groq', 'base_model':'mixtral-8x7b-32768'}]"

Gradio server-to-server

You can connect h2oGPT as a client to another h2oGPT Gradio server. Start a server:

SAVE_DIR=./save/ python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b

Then connect a second h2oGPT instance to it:

python generate.py \
  --inference_server="http://192.168.0.10:7680" \
  --base_model=h2oai/h2ogpt-oasst1-falcon-40b

Gradio live share links (https://*.gradio.live) and ngrok tunnels also work as the --inference_server value. If the prompt_type is not automatically detected, pass it explicitly:

python generate.py \
  --inference_server="http://192.168.0.10:7680" \
  --base_model=foo_model \
  --prompt_type=llama2

Replicate

Set REPLICATE_API_TOKEN, install the package, and pass the Replicate model string:

pip install replicate
export REPLICATE_API_TOKEN=<key>

python generate.py \
  --inference_server="replicate:lucataco/llama-2-7b-chat:6ab580ab4eef2c2b440f2441ec0fc0ace5470edaf2cbea50b8550aec0b3fbd38" \
  --base_model="TheBloke/Llama-2-7b-Chat-GPTQ"

Replicate is not recommended for private document Q&A — only chunks of documents are sent to the LLM, but those chunks leave your infrastructure. It is sufficient when full privacy is not required.

AWS SageMaker

h2oGPT integrates with AWS SageMaker endpoints. Set your AWS credentials and pass the endpoint name and region:

export AWS_ACCESS_KEY_ID=<...>
export AWS_SECRET_ACCESS_KEY=<...>

python generate.py \
  --inference_server=sagemaker_chat:<endpointname>:<region> \
  --base_model=h2oai/h2ogpt-4096-llama2-7b-chat

Streaming is not yet supported for the LangChain SageMaker integration.

Locking multiple models

Use --model_lock to start h2oGPT with multiple inference servers active simultaneously. This enables side-by-side model comparison:

python generate.py --model_lock="[
  {'inference_server':'http://192.168.1.46:6112','base_model':'h2oai/h2ogpt-oasst1-512-12b'},
  {'inference_server':'http://192.168.1.46:6114','base_model':'h2oai/h2ogpt-oasst1-512-20b'},
  {'inference_server':'openai_chat','base_model':'gpt-3.5-turbo'}
]" --model_lock_columns=3

No spaces are allowed inside the double-quoted --model_lock argument due to CLI argument parsing. Use single quotes inside the JSON-like structure.

Selecting visible models at startup

export vis="['h2oai/h2ogpt-4096-llama2-70b-chat','HuggingFaceH4/zephyr-7b-alpha','gpt-3.5-turbo-0613']"
python generate.py \
  --model_lock="$MODEL_LOCK" \
  --visible_models="$vis"

In-app model management

When running h2oGPT, you can add and switch models without restarting. In the Models tab:

Enter the model name (same as --base_model) and server URL (same as --inference_server).
Click Add new Model, Lora, Server url:port.
Select the model from the dropdown and click Load-Unload.

Click Load Model Names from Server to auto-populate model names from compatible inference servers (vLLM, oLLaMa, etc.).

--base_model is always required on the CLI. It is not auto-populated even when connecting to a server that exposes a model list endpoint.

Get Started

Core Features

Models & Backends

Advanced Usage

Help

Inference servers

Server string reference

oLLaMa

HuggingFace Text Generation Inference (TGI)

Connecting h2oGPT to TGI

Testing TGI

vLLM

Mixtral 8×7B with vLLM

vLLM AWQ example (Docker)

Cloud provider APIs

Gradio server-to-server

Replicate

AWS SageMaker

Locking multiple models

Selecting visible models at startup

In-app model management

Build docs developers (and LLMs) love

Get Started

Core Features

Models & Backends

Advanced Usage

Help

​Server string reference

​oLLaMa

​HuggingFace Text Generation Inference (TGI)

​Connecting h2oGPT to TGI

​Testing TGI

​vLLM

​Mixtral 8×7B with vLLM

​vLLM AWQ example (Docker)

​Cloud provider APIs

​Gradio server-to-server

​Replicate

​AWS SageMaker

​Locking multiple models

​Selecting visible models at startup

​In-app model management

Build docs developers (and LLMs) love

Server string reference

oLLaMa

HuggingFace Text Generation Inference (TGI)

Connecting h2oGPT to TGI

Testing TGI

vLLM

Mixtral 8×7B with vLLM

vLLM AWQ example (Docker)

Cloud provider APIs

Gradio server-to-server

Replicate

AWS SageMaker

Locking multiple models

Selecting visible models at startup

In-app model management