UCCL-EP: Portable Expert-Parallel Communication — Full Results

UCCL-EP

By: Ziming Mao (UC Berkeley), ChonLam Lao (Harvard), Yang Zhou (UC Davis), Yihan Zhang (UC Davis), Chihan Cui (UW-Madison), Zhongjie Chen (Tsinghua), Zhiying Xu (AWS), and other UCCL-EP contributors
Date: Dec 20, 2025

We present UCCL-EP, a portable expert-parallel communication library that achieves state-of-the-art performance across heterogeneous GPU and NIC hardware. UCCL-EP outperforms the best existing EP solution by up to 2.3x for dispatch and combine on AWS EFA. For end-to-end applications, UCCL-EP improves Megatron-LM training throughput by up to 45% over RCCL on 128 AMD GPUs andspeeds up SGLang inference throughput by up to 40% over NCCL over 32 H200 GPUs.

Paper: arxiv.org/pdf/2512.19849 | Code: uccl-project/uccl/ep

Recap: Why UCCL-EP?

In our previous blog post, we introduced UCCL-EP and the key challenge it addresses: state-of-the-art EP communication systems like DeepEP achieve high performance through GPU-initiated token-level RDMA, but are tightly coupled to the NVIDIA GPU + NVIDIA NIC ecosystem (via IBGDA). This tight coupling prevents running on public clouds like AWS (which use EFA NICs), on AMD GPUs, or on alternative NIC vendors like Broadcom.

UCCL-EP solves this with a clean separation of concerns:

  • GPUs retain fine-grained token-level communication initiation for maximal overlap with computation.
  • CPUs handle the control-intensive networking aspects — queue management, flow control, ordering enforcement — through a lightweight, multi-threaded proxy.

This architecture reduces the porting effort from O(m x n) (left figure, for m GPU vendors and n NIC vendors) down to O(m) (right figure), as the CPU proxy can use the libibverbs API that most RDMA NICs support. In fact, UCCL-EP provides drop-in support for vLLM, SGLang, and Megatron-LM, making it easy to adopt without modifying application or framework code.

IBGDA-style: O(m x n) porting effort UCCL-EP: O(m) porting effort

Left: IBGDA-style GPU-initiated communication requires O(m x n) porting effort across GPU and NIC vendors. Right: UCCL-EP reduces this to O(m) by using the CPU as a portable intermediary via libibverbs.

UCCL-EP architecture is shown in the figure below. GPUs delegate token-routing commands to multi-threaded CPU proxies via lock-free FIFO channels. CPU proxies issue GPUDirect RDMA on behalf of GPUs, while managing ordering, flow control, and completion handling.

UCCL-EP Architecture UCCL-EP architecture.

We discuss two key techniques in UCCL-EP (please refer to the paper for more details):

CPU Proxy with Lock-Free FIFO Channels. UCCL-EP uses a 128-bit TransferCmd descriptor that GPU threads enqueue into shared, lock-free FIFO channels. CPU proxy threads dequeue these commands and issue the corresponding RDMA operations. Each GPU uses multiple FIFO channels, each mapped to a dedicated RDMA Queue Pair (QP).

Immediate-Data-Based Ordering for Heterogeneous NICs. Different NICs provide different delivery guarantees. For example, AWS EFA uses the Scalable Reliable Datagram (SRD) protocol with multi-pathing, which does not guarantee in-order delivery within a single QP. UCCL-EP embeds per-channel sequence numbers into RDMA immediate data, allowing the receiver-side CPU proxy to reorder out-of-sequence control messages before committing them to GPU memory. This approach also enables software-level atomics — piggybacking completion notifications on regular RDMA writes — which is both more portable (no hardware atomic support required) and more efficient (one RDMA operation instead of two for write + atomic).

CPU Proxy Throughput Is Not the Bottleneck

We measure the FIFO channel latency and compare it against the end-to-end RDMA latency on EFA and Broadcom NICs. The FIFO channel sustains over 8 million operations per second, with average latency of ~3 us and P99 under 8 us — an order of magnitude lower than the RDMA network latency (which ranges from 20–50+ us on EFA). This confirms that the CPU-GPU channel is not the bottleneck, and the additional proxy latency is amortized across the much larger network transfer time.


Evaluation Testbeds

We have since evaluated UCCL-EP on a diverse set of platforms spanning NVIDIA and AMD GPUs with EFA, InfiniBand, and Broadcom NICs.

NameServersGPUNICCPUCloud
NV_EFAv34NVIDIA H200 x8AWS EFAv3 200G x16192 coresAWS (p5en)
NV_EFAv44NVIDIA B200 x8AWS EFAv4 400G x8192 coresAWS (p6-b200)
NV_IB4NVIDIA H100 x8ConnectX-7 400G x8128 coresNebius
NV_GH2002NVIDIA GH200 x1ConnectX-7 200G x172 coresLambda
AMD_CX74-16AMD MI300X x8ConnectX-7 400G x8128 coresOCI
AMD_BRC4AMD MI300X x8Broadcom Thor-2 400G x8128 coresVultr
AMD_AINIC4AMD MI355X x8Pollara AI NIC IB x8128 coresAMD

Baselines: NCCL/RCCL, DeepEP (NVIDIA-only), Perplexity Kernels (PPLX), and CPU-assisted IBGDA. UCCL-EP uses 4 CPU proxy threads per GPU.


Microbenchmark Results

On AWS EFA (NVIDIA GPUs)

H200 + EFA (p5en)

On p5en instances (8x H200, 16x 200 Gb/s EFA), we measure EP32 dispatch and combine latency while varying the number of tokens. UCCL-EP uses the minimum of high-throughput (HT) and low-latency (LL) mode latency, while PPLX operates in a single mode.

At small batch sizes (128 tokens), PPLX achieves lower latency because UCCL-EP (extending DeepEP) issues messages at 7 KB token granularity, whereas PPLX packs tokens into larger messages. However, as the token count increases, UCCL-EP quickly overtakes PPLX — delivering 2.3x lower dispatch latency and 1.1–1.5x lower combine latency for medium and large batches.

p5en dispatch latency vs tokens p5en combine latency vs tokens

EP32 dispatch (left) and combine (right) latency vs. number of tokens on p5en (H200 + EFA 400 Gbps).

We present more results in the table below.

Normal Mode Bandwidth (EFA)

Following the DeepSeek-V3 pretraining configuration (4096 tokens, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatch / BF16 combine):

PlatformType#EPDispatch BW & LatencyCombine BW & Latency
p5en (H200)Intranode8320 GB/s (NVLink), 500 us319 GB/s (NVLink), 973 us
p5en (H200)Internode1650 GB/s (RDMA), 1196 us18 GB/s (RDMA), 6379 us
p5en (H200)Internode2453 GB/s (RDMA), 1633 us26 GB/s (RDMA), 6365 us
p5en (H200)Internode3254 GB/s (RDMA), 2022 us43 GB/s (RDMA), 4899 us
p6 (B200)Intranode8280 GB/s (NVLink), 571 us426 GB/s (NVLink), 727 us
p6 (B200)Internode1653 GB/s (RDMA), 1141 us60 GB/s (RDMA), 1965 us
p6 (B200)Internode2453 GB/s (RDMA), 1637 us59 GB/s (RDMA), 2887 us
p6 (B200)Internode3253 GB/s (RDMA), 2072 us57 GB/s (RDMA), 3724 us
Low-Latency Mode (EFA)

Following a DeepSeek-V3 inference setting (128 tokens, 7168 hidden, top-8 experts, FP8 dispatch / BF16 combine):

Platform#EPDispatch LatencyDispatch BWCombine LatencyCombine BW
p5en (H200)16226 us36 GB/s293 us48 GB/s
p5en (H200)24386 us20 GB/s580 us26 GB/s
p5en (H200)32465 us16 GB/s694 us25 GB/s
p6 (B200)16228 us33 GB/s318 us46 GB/s
p6 (B200)24448 us17 GB/s566 us26 GB/s
p6 (B200)32406 us19 GB/s617 us24 GB/s

On InfiniBand (NVIDIA CX7)

On the Nebius testbed (H100 + CX7 InfiniBand), we compare UCCL-EP against both the original DeepEP and PPLX at EP32.

In LL mode, UCCL-EP incurs slightly higher latency than DeepEP and PPLX due to the CPU proxy overhead on small messages. However, in HT mode, UCCL-EP achieves latency within 5% of DeepEP for dispatch while outperforming PPLX by 2.1x (dispatch) and 1.6x (combine). This shows that UCCL-EP preserves DeepEP-level performance on throughput-oriented workloads even without IBGDA.

Nebius dispatch Nebius combine

EP32 dispatch (left) and combine (right) comparison on H100 + CX7 InfiniBand. UCCL-EP matches DeepEP in HT mode and significantly outperforms PPLX.


On AMD GPUs

UCCL-EP enables GPU-initiated token-level EP communication on AMD GPUs. We evaluate on MI300X with both CX7 InfiniBand (OCI) and Broadcom Thor-2 NICs (Vultr).

AMD dispatch AMD combine

EP32 dispatch (left) and combine (right) on AMD MI300X with CX7 RoCE and Broadcom Thor-2 NICs.

More results across heterogeneous platforms are shown in the tables below.

Normal Mode Bandwidth (AMD)

PlatformType#EPFP8 Dispatch BWBF16 Dispatch BWCombine BW
MI300X + CX7 RoCEIntranode8260 GB/s (xGMI)295 GB/s (xGMI)304 GB/s (xGMI)
MI300X + CX7 RoCEInternode1674 GB/s (RDMA)82 GB/s (RDMA)78 GB/s (RDMA)
MI300X + CX7 RoCEInternode3260 GB/s (RDMA)61 GB/s (RDMA)60 GB/s (RDMA)
MI300X + CX7 RoCEInternode6452 GB/s (RDMA)53 GB/s (RDMA)51 GB/s (RDMA)
MI355X + PollaraIntranode8299 GB/s (xGMI)336 GB/s (xGMI)333 GB/s (xGMI)
MI355X + PollaraInternode1682 GB/s (RDMA)82 GB/s (RDMA)82 GB/s (RDMA)
MI355X + PollaraInternode3259 GB/s (RDMA)58 GB/s (RDMA)59 GB/s (RDMA)
MI355X + PollaraInternode6450 GB/s (RDMA)49 GB/s (RDMA)49 GB/s (RDMA)
MI300X + BroadcomInternode1671 GB/s (RDMA)81 GB/s (RDMA)45 GB/s (RDMA)
MI300X + BroadcomInternode3249 GB/s (RDMA)55 GB/s (RDMA)50 GB/s (RDMA)

Low-Latency Mode (AMD MI300X + CX7 RoCE)

#EPDispatch LatencyDispatch BWCombine LatencyCombine BW
865 us114 GB/s92 us157 GB/s
16136 us55 GB/s207 us70 GB/s
32224 us30 GB/s341 us42 GB/s

Porting to AMD GPUs

UCCL-EP’s CPU proxy architecture makes porting to new GPU vendors straightforward — only the GPU kernels need to change, while the CPU-side NIC code remains identical. Porting to AMD GPUs and AWS EFA NICs took only 3 person-months.

The key changes for AMD included:

  • Migrating CUDA PTX intrinsics (atomics, memory fences, timers) to ROCm alternatives
  • Switching WARP_SIZE from 32 to 64 to match AMD wavefronts
  • Replacing NVIDIA TMA-based data copy with AMD CU-based (compute unit) copy
  • Merging “coordinator” wavefronts into “receiver” wavefronts in the HT kernel, since AMD GPUs support fewer wavefronts but more threads per wavefront

After these GPU-side changes, UCCL-EP immediately ran on AMD platforms with any supported NIC (CX7 RoCE, Broadcom Thor-2, Pollara AI NIC) — no per-NIC work was needed.


Application Results

Megatron-LM Training on AMD (MI300X + Broadcom)

We evaluate end-to-end Megatron-LM training of DeepSeek-V3 (downscaled to 32 layers and 379B parameters) on 16 MI300X nodes using the AMD Primus/Megatron-LM framework.

Megatron-LM TFLOPS Megatron-LM tokens/s

Megatron-LM training throughput (TFLOPS and tokens/s) on 16-node AMD MI300X + Broadcom Thor-2.

Across all configurations, UCCL-EP matches or exceeds the TFLOPS by 7–36% and throughput by 7–45% compared to RCCL. These results show that UCCL-EP provides significant performance benefits for MoE training on AMD hardware.


SGLang Inference on AWS (NVIDIA + EFA)

We evaluate UCCL-EP in SGLang v0.5.3 on a prefill-heavy workload on p5en instances. We compare against NCCL, as DeepEP cannot run on EFA and PPLX had not been integrated into open-source inference engines at the time of evaluation.

SGLang DeepSeek R1 throughput SGLang Qwen3 throughput

SGLang throughput comparison using UCCL-EP vs. NCCL on p5en. Left: DeepSeek R1 (671B). Right: Qwen3 (235B).

DeepSeek R1 (671B):

  • EP=16: UCCL-EP reaches 46K tok/s input throughput, about 5% higher than NCCL.
  • EP=32: UCCL-EP improves to 74K tok/s, a 1.6x speedup over its own EP=16 run.

Qwen3 (235B):

  • EP=32: UCCL-EP reaches 62K tok/s vs. 44K tok/s for NCCL — about 40% higher throughput.

UCCL-EP enables larger EP configurations (EP=32) where NCCL either significantly underperforms or cannot run. CPU utilization increases modestly from an average 8% to 22%.


Towards More Efficient Low-latency kernels

UCCL-EP’s low-latency (LL) kernel, extending DeepEP, currently issues one 7 KB token per RDMA message. On EFA NICs, this is suboptimal because the current EFA firmware cannot process small tokens at a high enough rate (AWS has confirmed they are working on a firmware fix). PPLX, by contrast, packs tokens into larger messages, giving it an advantage at small batch sizes.

A natural optimization is to pack tokens in a best-effort manner before sending. On the latest UCCL-EP, we have implemented per-expert batching of tokens for low-latency mode. On p5en (EP32, 128 tokens), batching improves BF16 dispatch from 299 us to 268 us (10.4% improvement) and FP8 dispatch from 217 us to 178 us (18.0% improvement). After batching, UCCL-EP outperforms PPLX by 51% on BF16 dispatch and 50% on FP8 dispatch.

We evaluate these LL improvements on p5en while comparing against PPLX on both FP8 and BF16 dispatch paths (see the fair-comparison figure below).

p5en dispatch BF16 to FP8 UCCL vs PPLX fair measurement P5en comparison with PPLX for BF16 and FP8 dispatch paths.

A Note on Fair Comparison with PPLX Kernels: PPLX uses a different measurement methodology than our original DeepEP-inherited setup. To ensure fairness, we evaluate the results above using a script that aligns with the PPLX measurement methodology. One caveat is that PPLX does not support in-kernel BF16->FP8 conversion. For fairness, the FP8 results we report include BF16->FP8 casting time before entering the PPLX kernel (266.90 us). The PPLX dispatch time excluding external BF16->FP8 pre-cast time is 232.30 us.


Enabling Flow Control

A key advantage of UCCL-EP’s CPU proxy architecture is the ability to implement flow control — which otherwise is difficult with IBGDA, where GPU threads typically blindly issue one-sided RDMA operations without awareness of NIC queue pressure or completion.

We observe that the number of outstanding RDMA requests can have a significant impact on various NICs, particularly affecting tail latency. This becomes increasingly critical as the number of destinations increases, where a single straggler can slow down the entire dispatch or combine operation.

We illustrate this on Broadcom Thor-2 (EP32, normal mode) by sweeping communication configurations, such as various NVLink chunk sizes and RDMA chunk sizes. Without flow control, non-optimal configurations catastrophically degrade — bandwidth can collapse from ~49 GB/s to as low as 4.4 GB/s, and notify latency can explode from hundreds of microseconds to over 70 ms. With flow control (e.g., 1.5 MB inflight limit), performance is stable across all configurations.

UCCL-EP’s CPU proxy supports request tracking and pacing:

  • If outstanding requests grow too high, the proxy temporarily buffers messages at the sender to avoid incast at the receiver.
  • The proxy can shard outgoing requests across multiple NICs and QPs to avoid congestion and adapt to NIC-specific characteristics.

This is particularly important for Broadcom Thor-2 and AMD Pollara AI NICs. We provide the environment variables UCCL_IB_MAX_INFLIGHT_BYTES and UCCL_IB_MAX_INFLIGHT_NORMAL to configure flow control. Without it, these NICs can encounter CQE error 12 (Transport Retry Counter Exceeded) under high load.


Acknowledgements

This research is supported by gifts from Accenture, AMD, Anyscale, AWS, Broadcom, Cisco, Google, IBM, Intel, Intesa Sanpaolo, Lambda, Lightspeed, Mibura, Microsoft, NVIDIA, Samsung SDS, and SAP.

Publish on 2026-03-28