CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

By: Shuang Ma, Yuyi Li, Yihan Zhang, Danyang Chen, Shuyang Ji, Ziming Mao, Cheng Ji, Ansha Prashanth, Wenting Yang, Yiran Wang, Chihan Cui, Peiyu Lin, Amanda Raybuck, Ion Stoica, Yang Zhou.
Date: June 9, 2026

GPU communication is a critical component of large-scale LLM training and inference, yet its complexity makes it challenging for code-generation models. We present CommBench, a benchmark with 100+ GPU communication problems + reference solutions (collectively called examples) that cover industry-level multi-device communication use cases based on UCCL's development experience. CommBench spans point-to-point, collective, expert-parallel, compute and communication fusion, and utility functions. These examples are either hand-written by GPU communication experts or distilled from production codebases such as Mscclpp, NCCL, NVSHMEM, DeepEP, ThunderKittens, vLLM, and SGLang. We then evaluate leading closed and open models under a cheat-resistant harness on real hardware spanning intra-node NVLink and inter-node RDMA, and present case studies of where and why they succeed or break down. As future work, we plan to post-train LLMs on these datasets to close this gap.

CommBench open-source: uccl-project/CommBench (MIT license).

Why Writing GPU Communication Code Matters—and Why It Remains Challenging for LLMs?

Communication and compute-communication fusion are essential for scaling modern LLM training and inference. In production training, communication can consume 43.6% of the forward pass ¹; in MoE inference with wide expert parallelism, inter-device communication accounts for up to 47% of total execution time ². Getting this code right and fast is not a nice-to-have.

The demand for customized GPU communication and compute-communication fusion is rapidly growing. Established libraries like NCCL offer comprehensive interfaces, but optimize for generality over frontier performance. As a result, companies often maintain in-house GPU communication and computation stacks for tighter control and optimization. GPU communication also remains a rapidly evolving area: new hardware and new LLM architectures continuously introduce new requirements for higher performance and specialized workloads, while communication abstractions are still evolving:

Modern GPUs are extremely powerful and expensive, motivating highly customized kernels and tighter compute–communication fusion to maximize hardware utilization across architectures such as Hopper, Blackwell, and AMD GPUs. As GPUs become faster, communication increasingly needs to be initiated directly from GPUs instead of relying on CPU-mediated execution paths used in traditional libraries such as NCCL.
New LLM architectures, such as MoE expert parallelism, introduce increasingly irregular and fine-grained communication patterns that are not well supported by existing collective libraries.

Multi-device GPU programming is inherently harder than single-device coding, for three reasons:

It demands niche expertise, requiring deep knowledge of both GPU kernels and networking.
It requires coordinating many devices over fail-prone interconnects, which is intrinsically difficult.
It lacks data, as practical, faithful datasets for GPU communication are largely missing.

Despite all this, multi-device GPU programming has been largely overlooked in LLM coding benchmarks. HumanEval, MBPP, LiveCodeBench — these measure single-device reasoning. No existing benchmark evaluates whether a model can generate correct GPU communication code, including communication primitives (e.g., Mscclpp channels and collective interfaces) and compute–communication fusion kernels (e.g., fused AllGather+GEMM across NVLink and InfiniBand).

Benchmark and Framework Structure

Benchmark Structure

The dataset is organized as a list of independently runnable examples, currently with 100+ such examples. Some implement complete, ready-to-use functionality (e.g., P2P/collective interfaces, or MoE expert-parallel dispatch and combine); others are reusable communication building blocks (e.g., Mscclpp channels). Drawing on our hands-on experience from the UCCL project, we manually assign each example one of three difficulty levels: Easy / Medium / Hard.

Some examples we hand-wrote on top of base libraries; others are extracted from production-grade communication and LLM-serving frameworks. By function, they fall into:

P2P — point-to-point transfer between a pair of devices.
Collective — group operations across all ranks (AllReduce, AllGather, All-to-All, …).
EP — dynamic, non-uniform dispatch/combine traffic for MoE models.
Fusion — kernels that interleave communication with compute (e.g., AllGather+GEMM).
Utilities — supporting components such as connection setup, buffer registration, and topology queries.

By source, they span cuda-runtime, libibverbs, Mscclpp, nccl, nvshmem, deepep, nccl-device-api, thunderkittens, vllm, and sglang.

Framework Structure

We built a framework that automatically evaluates different models on the dataset and supports multi-round refinement of generated code. Careful design of each example and its execution environment enforces strict checking and prevents the model from cheating.

Example structure. Each example has three parts:

Reference solution (excluded from the repo) — high-quality, hand-written code organized in an object-oriented style (Python / CUDA / HIP / C++). It ships with a test harness that uses randomized inputs, so a model cannot hard-code expected outputs. We keep the reference solutions secret to prevent them from leaking into model training data and contaminating future evaluations.
Problem statement + empty solution — the reference with its core implementation stripped out and marked // TODO, but with the functional interface preserved. This file becomes part of the prompt: the model must honor the interface semantics, which also keeps its output directly testable. We verify that only the regions meant to be edited were changed, guarding against cheating.
Build-and-run script — independently compiles and runs an example (reference or generated) and is hidden from the model. By controlling the compile command, we strictly ensure the generated code uses only the intended libraries.

Multi-round refinement. When max-round > 1, if the generated code fails to run or underperforms the reference, we feed the compile/run output back into the prompt and ask the model to iterate — repeating until performance improves or the round limit is reached.

Testbed. We evaluate on three clusters that together span NVLink, InfiniBand, and RoCE on both NVIDIA and AMD GPUs: (1) a single node of 8× NVIDIA B300 over NVLink; (2) NVIDIA GH200 nodes linked by 400 Gb/s InfiniBand (ConnectX-7 / BlueField-3) for inter-node RDMA; and (3) 8× AMD Instinct MI325X nodes with intra-node Infinity Fabric (XGMI) and 400 Gb/s RoCE inter-node RDMA. This lets CommBench exercise real inter-node RDMA paths, not only intra-node NVLink, across both vendors’ stacks (CUDA and ROCm/HIP).

Leaderboard

Sorted by Pass×GM ⭐ — pass rate scaled by geometric-mean code quality on passing examples. Bars ▓░ show each value on an absolute 0–100% scale (Pass×GM and GM‑Speedup as a fraction of 1.0).

Rank	Model	Pass×GM	Pass Rate	PASS+Good	GM‑Speedup	Open Source	Price
🥇	gpt-5.5	🟢 0.467 `▓▓▓▓░░░░`	🟢 57.4% `▓▓▓▓▓░░░`	🟢 30.7% `▓▓░░░░░░`	🔴 0.813 `▓▓▓▓▓▓▓░`	❌	$1.91
🥈	gemini-3.1-pro-preview	🟡 0.305 `▓▓░░░░░░`	🟡 36.6% `▓▓▓░░░░░`	🟢 25.7% `▓▓░░░░░░`	🔴 0.832 `▓▓▓▓▓▓▓░`	❌	$0.26
🥉	claude-opus-4-7	🟡 0.282 `▓▓░░░░░░`	🟡 33.7% `▓▓▓░░░░░`	🟡 20.8% `▓▓░░░░░░`	🔴 0.836 `▓▓▓▓▓▓▓░`	❌	$0.21
4️⃣	glm-5.1	🟡 0.281 `▓▓░░░░░░`	🟡 29.7% `▓▓░░░░░░`	🟡 17.8% `▓░░░░░░░`	🟢 0.947 `▓▓▓▓▓▓▓▓`	✅	$0.63
5️⃣	kimi-k2.6	🟡 0.275 `▓▓░░░░░░`	🟡 30.7% `▓▓░░░░░░`	🟡 18.8% `▓▓░░░░░░`	🟡 0.895 `▓▓▓▓▓▓▓░`	✅	$0.10
6️⃣	qwen3.7-max	🟡 0.269 `▓▓░░░░░░`	🟡 26.7% `▓▓░░░░░░`	🟡 15.8% `▓░░░░░░░`	🟢 1.008 `▓▓▓▓▓▓▓▓`	❌	$0.03
7️⃣	deepseek-v4-pro	🔴 0.197 `▓▓░░░░░░`	🔴 19.8% `▓▓░░░░░░`	🔴 12.9% `▓░░░░░░░`	🟢 0.995 `▓▓▓▓▓▓▓▓`	✅	$0.02

Color: 🟢 top tier · 🟡 mid tier · 🔴 bottom tier.

Metric Definitions

Metric	Formula	What it measures
Pass×GM ⭐	Pass Rate × GM‑Speedup	Pass rate scaled by geometric-mean code quality on passing examples. Combines coverage and quality: a model that rarely passes but generates very fast code scores the same as one that always passes at reference speed. Primary ranking metric.
Pass Rate	PASS / Total	Fraction of examples where code compiled, ran, and produced correct results.
PASS+Good	(on_compare + better) / Total	Fraction of all examples with correct and performant code (within −5% of reference).
GM‑Speedup	`GM` over scored PASS of `scoreᵢ`	Geometric mean of per-example speedup scores over passing examples. The per-example score is `scoreᵢ = GMₛ( gen(i,s) / ref(i,s) )`, where `s` iterates over measured data sizes and `gen(i,s)`, `ref(i,s)` are the performance metrics of the generated and reference code on example `i` at data size `s`, expressed so that higher is better (lower-is-better metrics such as latency are inverted to `1/latency` first). The choice of primary metric per example is determined by human review. Taking the GM across sizes (rather than averaging absolute throughput first) ensures each data point contributes equally regardless of absolute throughput magnitude; without it, large sizes (e.g. 128 MB at ~100 GB/s) would dominate small ones (e.g. 256 KB at ~5 GB/s). Caveat: computed only over passing examples. A model that passes more (often harder) examples with mediocre performance can show a worse GM‑Speedup, hence we rank by Pass×GM.
Price	—	Average cost per example (USD).

Performance verdict thresholds (4-tier): better ≥ +20% · on_compare −5% to +20% · degraded −40% to −5% · severely_degraded < −40%

Reasoning effort and model parameters. We run every model at its default reasoning effort and default sampling parameters (temperature, top-p, etc.); we do not tune them. The per-model defaults are: gpt-5.5 = medium, gemini-3.1-pro-preview = high, claude-opus-4-7 = adaptive thinking, glm-5.1 = thinking enabled (the model decides when to think), kimi-k2.6 = thinking enabled (temperature fixed at 1.0), deepseek-v4-pro = medium. The only exception is qwen3.7-max, where thinking is disabled to avoid timeouts.

Analysis: Top vs. Bottom Model

We select the highest- and lowest-scoring models on CommBench, gpt-5.5 (Pass×GM = 0.467) and deepseek-v4-pro (Pass×GM = 0.197), for a detailed breakdown across difficulty levels, task types, library coverage, and code performance.

Difficulty Breakdown

GPT-5.5	DeepSeek-V4-Pro

Human-defined difficulty does not always align with model difficulty. deepseek’s weak Easy performance (14%) is largely attributable to gaps in library-specific API knowledge — for libraries such as mscclpp and ThunderKittens, it lacks the interface semantics needed to call APIs correctly, causing failures on examples that only require invoking a small number of functions to implement straightforward functionality. On the Hard end, many examples ask the model to implement communication primitives from near-scratch (e.g. AllReduce over NVLink or RDMA) which exposes deepseek’s limited capability on complex GPU communication-compute tasks.

Performance Quality Among PASS Examples

GPT-5.5	DeepSeek-V4-Pro

gpt-5.5 has the widest quality spread: 8 severely degraded cases (14% of PASS), all concentrated in Hard examples. This is partly because examples that gpt-5.5 manages to compile and run (but with degraded performance) would simply fail to compile or execute under deepseek, and therefore never appear in deepseek’s performance distribution at all.

Tag and Library Coverage

GPT-5.5	DeepSeek-V4-Pro

gpt-5.5 dominates on Collective (66% vs 22%) and is the only model with meaningful coverage of specialized libraries — mscclpp (17%), nccl-device-api (40%), and thunderkittens (54%). deepseek is competitive on NCCL (100%) and P2P tags, but scores 0% across all three specialized libraries. For widely adopted libraries such as vllm, NCCL, and nvshmem, both models perform well.

DeepSeek with max_rounds = 5

Due to budget constraints, gpt-5.5 was evaluated with a single generation round. Despite deepseek’s weaker first-round performance, its API cost is only ~1% of gpt-5.5’s ($0.02 vs $1.91 per example), which makes multi-round self-correction economically viable. Allowing deepseek up to 5 self-correction rounds substantially changes its profile.

Cumulative PASS by Round

Round	Cumulative PASS	Pass Rate	New this round
1	16	15.8%
2	28	27.7%	+12 (largest gain)
3	32	31.7%	+4
4	34	33.7%	+2
5	42	41.6%	+8

Difficulty Breakdown: max=1 vs max=5

DeepSeek max=1	DeepSeek max=5

Performance Quality and Library Coverage (max=5)

Performance distribution among PASS examples	Tag & library coverage

Multi-round self-correction more than doubles deepseek’s overall pass rate (16→42, 15.8%→41.6%) and substantially unlocks Medium-difficulty commodity-library tasks (31% → 60%). It does not unlock Hard examples or specialized libraries (mscclpp: 0%, thunderkittens: 4%) — those require domain knowledge the model does not have. The practical implication: deepseek with retries is a reasonable choice when tasks are restricted to commodity libraries (NCCL, vllm, cuda-runtime, nvshmem) and a retry budget is acceptable.

Case Studies

Case 1: Partial Pass — ThunderKittens AllToAll

Example 049 — ThunderKittens AllToAll ↗

Level 🟢 Easy · Tag Collective · Library thunderkittens

Task: Implement a BF16 multi-GPU AllToAll kernel using the ThunderKittens library, using TMA (Tensor Memory Accelerator) for tile-level data movement between GPU shards over NVLink.

What to implement: Two small device-side kernels: all_to_all::kernel (TMA-load one tile from local input, TMA-store to destination device’s output shard) and all_to_all_barrier::kernel (one-line barrier_all synchronization). All host scaffolding is provided.

Why most models failed: Both kernels are under 10 lines and only require calling existing ThunderKittens interfaces, with no custom algorithm needed. The failure root is ThunderKittens-specific API knowledge: type scoping (all_to_all::globals::shared_tile vs bare shared_tile), the correct tma::load_async argument signature, and the shared_allocator::allocate template usage. ThunderKittens is a niche research library with minimal training-data coverage, causing models to mis-scope types, hallucinate non-existent APIs, or use wrong argument forms.

Per-Model Summary

Model	Rounds	Outcome	Perf vs Ref	Failure Pattern
gpt-5.5	1	PASS	on_par (−0.23%)	—
claude-opus-4-7	1	PASS	on_par (+0.17%)	—
deepseek-v4-pro	5	Compile error (r1–r2, r4–r5) / Link error (r3)	—	r1: hallucinated ThunderKittens API (`tma::semaphore`, `tma::init_semaphore`, `tma::wait`, wrong `barrier_all` form — none exist); r2: structural C++ errors (`__global__` written as member function, duplicate class definition, wrong CUDA driver API types); r3: compiled but no `main()` → linker `undefined reference to 'main'`; r4: used `CUDA_CHECK` macro without defining it; r5: early type/template error triggered 43 cascading STL errors
gemini-3.1-pro-preview	1	Compile error	—	`shared_tile` used without `all_to_all::globals::` namespace qualifier
kimi-k2.6	1	Compile error	—	Same unqualified `shared_tile` reference
glm-5.1	1	Compile error	—	Wrong type passed to `shared_allocator::allocate`
qwen3.7-max	1	Compile error	—	Wrong argument types for `tma::load_async`

Case 2: A Case All LLMs Failed

Example 072 — GPU Barrier Within CTA ↗

Level 🟡 Medium · Tag Utilities · Library cuda-runtime

Task: Implement intra-CTA producer/consumer synchronization using a Hopper/Blackwell shared-memory mbarrier with a non-blocking try_wait probe, a pattern used in persistent-kernel tile pipelines (e.g., CUTLASS, Mirage) to overlap data arrival with computation without stalling the warp scheduler.

What to implement: Three inline-PTX device helpers (initialize_barrier, arrive, try_wait_barrier) and a benchmark kernel body, lifted from Mirage.

Why all models failed: This task only requires using a small set of well-defined mbarrier interfaces, which is the kind of API memorization LLMs are expected to excel at. However, mbarrier is Hopper-specific PTX introduced in sm_90, and models may lack sufficient training data for these instructions, making them prone to mis-remembering argument counts, ordering, or address-space requirements.

Per-Model Failure Summary

Model	Rounds	Outcome	Failure Pattern
deepseek-v4-pro	5	Compile error	Wrong PTX register constraint (r1); missing `mbarrier.arrive` destination operand (r2–r5)
gpt-5.5	1	Deadlock	Deadlock: consumer runs before producers due to warp branch ordering; no `kProbeBound` guard so kernel hangs forever
gemini-3.1-pro-preview	1	Correctness FAIL	Deadlock: consumer runs before producers; exits via `kProbeBound` and reads uninitialized tile
claude-opus-4-7	1	Correctness FAIL	Deadlock: consumer runs before producers; also incorrect `phase ^= 1` since `mbarrier.init` resets parity to 0 each iteration
glm-5.1	1	Runtime error	Inline-asm operand index confusion
kimi-k2.6	1	Compile error	Hallucinated CUDA intrinsic name
qwen3.7-max	1	Correctness FAIL	Wrong PTX register constraint (`"l"` instead of `"r"`)

Case 3: All LLMs Failed — Niche Library API with No Algorithmic Hints

Example 086 — mscclpp AllToAll (Hard) ↗

Level 🔴 Hard · Tag Collective · Library mscclpp

Task: Implement the fastest intra-node AllToAll kernel using MSCCL++ ³ MemoryChannel primitives. The template provides no algorithmic description — only a minimal comment “implement the fastest All to All CUDA kernel” and a set of empty // TODO stubs.

What to implement: One GPU kernel (alltoall2) using MemoryChannel for direct peer writes, and the full host-side All2All class (constructor, buffer allocation, channel setup, launch, correctness verification, barrier).

Why all models failed: The task requires models to implement a functionally complete, semantically correct intra-node AllToAll from scratch (device kernel, host class, channel setup, buffer management, and correctness verification), all consistent with MSCCL++‘s MemoryChannel interface contract, with no algorithmic hints provided. This is hard even for an expert programmer unfamiliar with mscclpp. Every model failed at compile time: mscclpp headers and class interfaces are largely absent from training data, causing models to hallucinate non-existent header paths and method signatures before reaching any algorithmic logic.

Per-Model Failure Summary

Model	Rounds	Outcome	Failure Pattern
deepseek-v4-pro	5	Compile error	r1: used mscclpp types with no includes (namespace undefined); r2–r5: each round invented a different non-existent mscclpp header
gpt-5.5	1	Compile error	Included some mscclpp headers but missed the one for `MemoryChannel`; also hallucinated a wrong class name
gemini-3.1-pro-preview	1	Compile error	Hallucinated mscclpp API
claude-opus-4-7	1	Compile error	Hallucinated mscclpp API
glm-5.1	1	Compile error	Hallucinated mscclpp API
kimi-k2.6	1	Compile error	Hallucinated mscclpp API
qwen3.7-max	1	Compile error	Hallucinated mscclpp API

Conclusion

CommBench shows that multi-device GPU communication remains a clear blind spot for today’s frontier LLMs. Even the strongest model passes under 60% of examples and produces performant code on only a third, while every model collapses to near-zero coverage on specialized libraries such as Mscclpp, ThunderKittens, and the NCCL device API. The failures are not superficial: models hallucinate APIs, misplace synchronization, and ship kernels orders of magnitude slower than reference — gaps that compiler feedback and multi-round retries alone cannot close. Multi-round self-correction helps only on commodity libraries and easier tasks; closing the remaining gap will require targeted post-training on the kind of data CommBench provides. We open-source the dataset and harness to support that effort.

Acknowledgements

We thank Mibura and AMD for sponsoring the testbed for this benchmark.

References

Chao Jin et al. MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production. EuroSys, 2026. ↩
Shulai Zhang et al. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. MLSys, 2025. ↩
Changho Hwang et al. MSCCL++: Rethinking GPU Communication Abstractions for AI Inference. ASPLOS, 2026. ↩

LLMs Benchmark Code Generation GPU Communication NCCL RDMA CUDA MSCCLPP

Publish on 2026-06-09 Enjoy the blog? Give UCCL a star

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

Why Writing GPU Communication Code Matters—and Why It Remains Challenging for LLMs?

Benchmark and Framework Structure

Benchmark Structure

Framework Structure

Leaderboard

Metric Definitions

Analysis: Top vs. Bottom Model

Difficulty Breakdown

Performance Quality Among PASS Examples

Tag and Library Coverage

DeepSeek with max_rounds = 5

Cumulative PASS by Round

Difficulty Breakdown: max=1 vs max=5

Performance Quality and Library Coverage (max=5)

Case Studies

Case 1: Partial Pass — ThunderKittens AllToAll

Example 049 — ThunderKittens AllToAll ↗

Per-Model Summary

Case 2: A Case All LLMs Failed

Example 072 — GPU Barrier Within CTA ↗

Per-Model Failure Summary

Case 3: All LLMs Failed — Niche Library API with No Algorithmic Hints

Example 086 — mscclpp AllToAll (Hard) ↗

Per-Model Failure Summary

Conclusion

Acknowledgements

References

rdmatop: Cross-Provider htop for RDMA Traffic

Reading OpenAI's MRC Through a UCCL Lens

mKernel: Fast Multi-GPU, Multi-Node Fused Kernels