When a single GPU is not enough

The largest open-weight models available today exceed the memory capacity of any single GPU. A 70-billion parameter model in FP16 requires approximately 140 GB of GPU memory just for the model weights, before accounting for KV cache, activation memory, and framework overhead. Even with quantization reducing this to 35-70 GB depending on the method, many models simply do not fit on a single device. Multi-GPU inference is not optional for these models; it is a prerequisite.

But even when a model fits on a single GPU, there are throughput and latency reasons to split it across multiple devices. A single GPU serving a 13B parameter model might handle 20 concurrent requests before throughput saturates. Distributing that model across two GPUs can increase throughput by enabling more parallel processing of requests, or reduce per-request latency by processing different parts of the model simultaneously.

On-premises deployments face specific constraints that make the parallelism strategy choice more consequential than in cloud environments. GPU interconnect bandwidth varies dramatically between hardware configurations: NVLink provides 600-900 GB/s between GPUs in the same node, while PCIe 4.0 x16 provides only 32 GB/s, and cross-node InfiniBand provides 25-50 GB/s. The parallelism strategy must match your hardware topology, or the communication overhead will eliminate any benefit from using multiple GPUs.

Tensor parallelism: splitting layers across GPUs

Tensor parallelism (TP) splits individual layers of the model across multiple GPUs. Each GPU holds a slice of every layer, and they work together to compute the output of each layer in parallel. For a transformer model, this means the attention heads and feed-forward network matrices are partitioned column-wise or row-wise across GPUs, with each GPU computing its portion and then communicating partial results to produce the final output.

The key operations in tensor-parallel inference are all-reduce and all-gather communications that happen after each layer's computation. After the column-parallel matrix multiplication in the feed-forward network, an all-reduce sums the partial results across GPUs. This communication happens at every transformer layer, meaning TP generates high-frequency, moderate-sized communication between GPUs throughout inference.

The advantage of TP is reduced latency. Because all GPUs process every token simultaneously (each handling its slice of the computation), the time to process a single token is reduced compared to running the full model on one GPU. For latency-sensitive applications like interactive chat, this makes TP the preferred strategy. With 2-way TP on NVLink-connected GPUs, per-token latency reductions of 30-40% are typical for large models.

The disadvantage is the communication bandwidth requirement. The all-reduce operations at every layer require high-bandwidth, low-latency interconnects. On NVLink, the communication overhead is small relative to computation time. On PCIe, the overhead becomes significant: each all-reduce for a 70B model's layer involves transferring tens of megabytes of data, and with 80 transformer layers, the cumulative communication cost can consume half the total inference time. Tensor parallelism across PCIe-connected GPUs is generally not recommended for latency-sensitive workloads.

Pipeline parallelism: splitting the model into stages

Pipeline parallelism (PP) assigns different layers of the model to different GPUs. GPU 0 might hold layers 0-19, GPU 1 holds layers 20-39, GPU 2 holds layers 40-59, and GPU 3 holds layers 60-79. A request enters at GPU 0, flows through each stage sequentially, and exits from GPU 3 with the completed output. Each GPU only communicates with its immediate neighbors, passing the hidden state tensor between stages.

The communication pattern in PP is fundamentally different from TP. Instead of all-reduce operations at every layer, PP performs point-to-point transfers of the hidden state between stages. These transfers are less frequent (once per stage boundary rather than once per layer) and smaller in size (the hidden state is typically 4-16 KB per token, compared to the megabytes exchanged in TP all-reduces). This makes PP much more tolerant of limited interconnect bandwidth.

PP is the correct choice when GPUs are connected via PCIe or when spanning multiple nodes over network interconnects. The communication volume is low enough that even PCIe 3.0 bandwidth is sufficient for most models. Cross-node PP over InfiniBand introduces latency at each stage boundary, but for batch processing workloads where per-request latency is less critical, this is acceptable.

The primary disadvantage of PP for inference is pipeline bubbles. When processing a single request, only one stage is active at a time while the other GPUs wait. GPU 0 processes the request through its layers, then sits idle while GPUs 1-3 do their work. This means that for a single request, PP provides zero latency benefit and simply distributes the memory across GPUs. The throughput benefit of PP comes from filling the pipeline with multiple requests: while GPU 3 processes request 1, GPU 2 processes request 2, GPU 1 processes request 3, and so on. With enough concurrent requests, all stages stay busy and total throughput scales with the number of stages.

Hybrid parallelism and practical configurations

Production deployments often combine both strategies in a hybrid parallelism configuration. The typical pattern is to use tensor parallelism within a node (where NVLink provides high bandwidth) and pipeline parallelism across nodes (where network bandwidth is the bottleneck). A 4-node cluster with 4 GPUs per node might use 4-way TP within each node and 4-way PP across nodes, distributing a model across all 16 GPUs.

The optimal configuration depends on your specific hardware and workload. Here are practical guidelines for common on-premises setups:

Single node, 2 GPUs with NVLink: Use 2-way TP. The NVLink bandwidth handles the all-reduce communication efficiently, and TP provides the best single-request latency.

Single node, 4-8 GPUs with NVLink: Use TP up to the NVLink domain size. On DGX-style systems where all GPUs have full NVLink connectivity, use 4-way or 8-way TP. On systems where NVLink only connects GPU pairs (e.g., 0-1 and 2-3 are NVLink-connected but 1-2 communicate via PCIe), use 2-way TP within NVLink pairs and 2-way or 4-way PP across pairs.

Single node, GPUs connected only via PCIe: Prefer PP over TP. If the model fits on a single GPU with quantization, run it on one GPU and use the others for additional model replicas to increase throughput. If multi-GPU is required for memory, use PP with micro-batching to fill the pipeline.

Multi-node cluster: Always use PP across nodes. Use TP within each node based on the intra-node interconnect (NVLink: TP, PCIe: PP or single-GPU replicas). The latency and bandwidth limitations of cross-node communication make TP across nodes impractical for all but the highest-end InfiniBand configurations.

Measuring and optimizing parallelism performance

After choosing a parallelism strategy, instrument your serving stack to measure actual performance and identify bottlenecks. The key metrics to track are time-to-first-token (TTFT), inter-token latency (ITL), throughput (tokens per second across all concurrent requests), and GPU utilization per device.

For TP configurations, monitor the time spent in all-reduce operations using your inference framework's profiling tools. vLLM exposes NCCL communication time in its profiling output. If communication time exceeds 20% of total per-token time, your interconnect is the bottleneck. Solutions include reducing the TP degree (using PP for the remaining parallelism), enabling communication-computation overlap if your framework supports it, or upgrading to a faster interconnect.

For PP configurations, monitor the pipeline bubble ratio: the fraction of time each GPU spends idle waiting for input from the previous stage or output consumption by the next stage. A high bubble ratio means you need more concurrent requests to fill the pipeline. Calculate the minimum batch size needed to saturate the pipeline as approximately equal to the number of pipeline stages. With 4-way PP, you need at least 4 concurrent requests to keep all stages busy during the decode phase.

Compare your measured throughput against the theoretical maximum. For TP, the theoretical throughput with N GPUs is N times single-GPU throughput minus communication overhead. For PP, the theoretical throughput is single-GPU throughput (each stage processes requests at the same speed as one GPU) with latency proportional to the number of stages for a single request. If measured throughput is below 70% of theoretical, investigate whether the bottleneck is communication, memory bandwidth, or load imbalance between stages.

Load imbalance between PP stages is a common issue. Transformer models are not perfectly uniform: the embedding layer and the language model head have different computational profiles from the transformer blocks. Assign the embedding layer and a few transformer blocks to the first stage, and the LM head and fewer transformer blocks to the last stage, so that total computation time per stage is approximately equal. Most inference frameworks handle this automatically, but verify by comparing per-stage processing times.

Making the right choice for your deployment

The decision between TP, PP, and hybrid parallelism reduces to three questions. First, what is your interconnect topology? NVLink enables TP; PCIe and network interconnects favor PP. Second, what is your latency requirement? Interactive applications need TP for low per-request latency; batch processing can tolerate PP's pipeline latency. Third, what is your concurrency level? PP needs concurrent requests to fill the pipeline; if you consistently serve one request at a time, PP provides no throughput benefit.

For most enterprise on-premises deployments running models in the 30B-70B parameter range on multi-GPU servers with NVLink, tensor parallelism across all GPUs in a single node is the default choice. It provides the best latency, scales well up to 8 GPUs with modern NVLink configurations, and is well-supported by major inference frameworks including vLLM, TGI, and TensorRT-LLM.

When scaling beyond a single node, add pipeline parallelism across nodes while maintaining TP within each node. This hybrid approach matches the communication strategy to the available bandwidth at each level of the hardware hierarchy, maximizing both throughput and latency performance.

Document your parallelism configuration as part of your model deployment specification. Include the TP degree, PP degree, GPU-to-stage mapping, and the hardware topology assumptions. When hardware changes (adding GPUs, upgrading interconnects, or migrating to new servers), revisit the parallelism configuration. A strategy optimized for one hardware configuration can be suboptimal on another, and the performance difference between correct and incorrect parallelism choices is substantial.

Featured image by Bill Fairs on Unsplash.

AI-Driven Consulting

People & Culture

Academy

Who we are

What we do

Resources

Career

Search across SysArt

Multi-GPU Inference Parallelism: Tensor vs Pipeline Splitting On-Premises