Overview
Tensor parallelism shards the model’s weight matrices across multiple GPUs. Each GPU holds a horizontal slice of every linear layer and computes its portion of each matrix-vector product. The partial results are then reduced across devices via an all-reduce collective using PyTorch NCCL.
This allows Nano-vLLM to:
- Fit models that exceed a single GPU’s VRAM.
- Increase decode throughput on multi-GPU nodes by reducing per-GPU memory bandwidth pressure.
Enabling tensor parallelism
Set tensor_parallel_size in the LLM constructor to the number of GPUs you want to use:
from nanovllm import LLM, SamplingParams
llm = LLM(
"/path/to/model",
tensor_parallel_size=2, # use 2 GPUs
)
The default is tensor_parallel_size=1 (single GPU, no distribution).
Requirements and constraints
tensor_parallel_size must evenly divide both num_attention_heads and num_key_value_heads in the model’s HuggingFace config. For example, a model with 8 KV heads cannot use tensor_parallel_size=3.
Valid values are integers in the range 1–8, enforced by Config.__post_init__:assert 1 <= self.tensor_parallel_size <= 8
The number of GPUs reported by torch.cuda.device_count() must be at least tensor_parallel_size. CUDA device indices 0, 1, …, N-1 are used in order.
Example: two-GPU inference
import os
from nanovllm import LLM, SamplingParams
path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
llm = LLM(
path,
tensor_parallel_size=2,
enforce_eager=False, # CUDA graphs still work with TP
max_model_len=4096,
)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Explain tensor parallelism in two sentences."]
outputs = llm.generate(prompts, sampling_params)
print(outputs[0]["text"])
How it works internally
LLMEngine.__init__ spawns one worker process per additional GPU using torch.multiprocessing with the spawn start method. Rank 0 runs in the main process; ranks 1…N-1 each run in a child process.
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
event = ctx.Event()
process = ctx.Process(target=ModelRunner, args=(config, i, event))
process.start()
self.ps.append(process)
self.events.append(event)
self.model_runner = ModelRunner(config, 0, self.events)
Each ModelRunner calls dist.init_process_group("nccl", "tcp://localhost:2333", ...) and sets its CUDA device to its rank:
dist.init_process_group("nccl", "tcp://localhost:2333", world_size=self.world_size, rank=rank)
torch.cuda.set_device(rank)
Inter-process communication via SharedMemory
Rather than pickling tensors through OS pipes, rank 0 serializes call arguments into a POSIX shared memory segment (SharedMemory) and signals worker processes through multiprocessing.Event:
# rank 0 – write
def write_shm(self, method_name, *args):
data = pickle.dumps([method_name, *args])
n = len(data)
self.shm.buf[0:4] = n.to_bytes(4, "little")
self.shm.buf[4:n+4] = data
for event in self.event:
event.set()
# rank > 0 – read
def read_shm(self):
self.event.wait()
n = int.from_bytes(self.shm.buf[0:4], "little")
method_name, *args = pickle.loads(self.shm.buf[4:n+4])
self.event.clear()
return method_name, args
After the IPC handshake, actual tensor data flows directly between GPUs over NVLink or PCIe via NCCL all-reduce operations inside the model layers — no CPU round-trip.
KV cache sharding
The KV cache is also sharded: each GPU allocates num_key_value_heads // world_size KV heads per layer:
num_kv_heads = hf_config.num_key_value_heads // self.world_size
This keeps per-GPU memory proportional to 1/N, so using more GPUs directly increases the total KV cache capacity available to the scheduler.