UCCL

RDMA EFA InfiniBand RoCE NCCL NVSHMEM Monitoring 2026-06-15

rdmatop: Cross-Provider htop for RDMA Traffic

rdmatop is htop for RDMA traffic: a real-time TUI for monitoring RDMA NICs in multi-node LLM training and inference, surfacing bottlenecks NCCL and NVSHMEM hide.

LLMs Benchmark Code Generation GPU Communication NCCL RDMA CUDA MSCCLPP 2026-06-09

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

CommBench evaluates how effectively frontier LLMs generate multi-device GPU communication code, covering diverse communication functionalities and compute–communication fusion kernels.

MRC SRv6 SRD Multi-plane UCCL-Tran 2026-05-26

Reading OpenAI's MRC Through a UCCL Lens

A close reading of OpenAI/Microsoft/AMD/Broadcom/NVIDIA's MRC + SRv6 paper and the MRC OCP specification, viewed from the UCCL perspective: a real step forward over traditional RoCE v2 RC, but with co...

Fused Kernels RDMA 2026-05-25

mKernel: Fast Multi-GPU, Multi-Node Fused Kernels

mKernel is a collection of multi-GPU, multi-node fused kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single persistent kernel.

TrainMover Overview — Two-phase machine migration design

LLM Training Fault Tolerance Systems OSDI 2026-05-18

When Your AI Training Cluster Crashes at 3 AM: How TrainMover Cuts Recovery Time to 20 Seconds

An educational deep-dive into interruption resilience for large-scale LLM training, and how TrainMover (OSDI '26) reduces downtime to ~20 seconds with zero memory overhead.

RDMA EFA 2026-04-13

A Practitioner Guide to AWS EFA Programming

Programming AWS EFA NICs for efficient GPU communication.

MoE DeepEP RDMA Expert Parallelism AMD EFA 2026-04-06

UCCL-EP: Portable Expert-Parallel Communication — Full Results

Full evaluation of UCCL-EP across NVIDIA and AMD GPUs, AWS EFA, InfiniBand, and Broadcom NICs — with application-level results on SGLang, vLLM inference and Megatron-LM training.

MoE DeepEP IBGDA RDMA 2025-10-27

Previewing UCCL-EP: Flexible and Efficient Expert Parallelism for Cloud and Beyond

GPU-driven communication (e.g., DeepEP) is the key to efficient and large-scale EP, but it cannot run on heterogeneous platforms in the public cloud due to tight coupling between GPU and NIC.

NIXL NCCL RCCL Mooncake RDMA 2025-08-13

Everything You Want to Know about KV Cache Transfer Engine

There have been many KV cache transfer engines for PD disaggregation, but nearly no benchmarks on their performance. This blog serves for this purpose---benchmarking and analyzing the performance of v...

NCCL RCCL RDMA 2025-06-30

How to Debug NCCL Performance Issues for ML Workloads?

NCCL is notoriously hard to debug. In this post, we will go through our journey of debugging NCCL performance issues and how UCCL can help this process.

Networking AI RDMA 2025-05-26

UCCL-Tran: An Extensible Software Transport Layer for GPU Networking

UCCL-Tran is designed to be fast and extensible to meet the challenging requirements of modern ML/LLM workloads

Networking AI Sky Computing 2025-05-26

About Us

About UCCL team