Skip to main content

Overview

Tensor parallelism shards the model’s weight matrices across multiple GPUs. Each GPU holds a horizontal slice of every linear layer and computes its portion of each matrix-vector product. The partial results are then reduced across devices via an all-reduce collective using PyTorch NCCL. This allows Nano-vLLM to:
  • Fit models that exceed a single GPU’s VRAM.
  • Increase decode throughput on multi-GPU nodes by reducing per-GPU memory bandwidth pressure.

Enabling tensor parallelism

Set tensor_parallel_size in the LLM constructor to the number of GPUs you want to use:
from nanovllm import LLM, SamplingParams

llm = LLM(
    "/path/to/model",
    tensor_parallel_size=2,   # use 2 GPUs
)
The default is tensor_parallel_size=1 (single GPU, no distribution).

Requirements and constraints

tensor_parallel_size must evenly divide both num_attention_heads and num_key_value_heads in the model’s HuggingFace config. For example, a model with 8 KV heads cannot use tensor_parallel_size=3.
Valid values are integers in the range 1–8, enforced by Config.__post_init__:
assert 1 <= self.tensor_parallel_size <= 8
The number of GPUs reported by torch.cuda.device_count() must be at least tensor_parallel_size. CUDA device indices 0, 1, …, N-1 are used in order.

Example: two-GPU inference

import os
from nanovllm import LLM, SamplingParams

path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")

llm = LLM(
    path,
    tensor_parallel_size=2,
    enforce_eager=False,   # CUDA graphs still work with TP
    max_model_len=4096,
)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Explain tensor parallelism in two sentences."]
outputs = llm.generate(prompts, sampling_params)
print(outputs[0]["text"])

How it works internally

LLMEngine.__init__ spawns one worker process per additional GPU using torch.multiprocessing with the spawn start method. Rank 0 runs in the main process; ranks 1…N-1 each run in a child process.
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
    event = ctx.Event()
    process = ctx.Process(target=ModelRunner, args=(config, i, event))
    process.start()
    self.ps.append(process)
    self.events.append(event)
self.model_runner = ModelRunner(config, 0, self.events)
Each ModelRunner calls dist.init_process_group("nccl", "tcp://localhost:2333", ...) and sets its CUDA device to its rank:
dist.init_process_group("nccl", "tcp://localhost:2333", world_size=self.world_size, rank=rank)
torch.cuda.set_device(rank)

Inter-process communication via SharedMemory

Rather than pickling tensors through OS pipes, rank 0 serializes call arguments into a POSIX shared memory segment (SharedMemory) and signals worker processes through multiprocessing.Event:
# rank 0 – write
def write_shm(self, method_name, *args):
    data = pickle.dumps([method_name, *args])
    n = len(data)
    self.shm.buf[0:4] = n.to_bytes(4, "little")
    self.shm.buf[4:n+4] = data
    for event in self.event:
        event.set()

# rank > 0 – read
def read_shm(self):
    self.event.wait()
    n = int.from_bytes(self.shm.buf[0:4], "little")
    method_name, *args = pickle.loads(self.shm.buf[4:n+4])
    self.event.clear()
    return method_name, args
After the IPC handshake, actual tensor data flows directly between GPUs over NVLink or PCIe via NCCL all-reduce operations inside the model layers — no CPU round-trip.

KV cache sharding

The KV cache is also sharded: each GPU allocates num_key_value_heads // world_size KV heads per layer:
num_kv_heads = hf_config.num_key_value_heads // self.world_size
This keeps per-GPU memory proportional to 1/N, so using more GPUs directly increases the total KV cache capacity available to the scheduler.

Build docs developers (and LLMs) love