Highlights:

  • Perplexity AI introduces TransferEngine, a portable RDMA communication layer for LLM systems.
  • Achieves peak throughput of 400 Gbps on NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA).
  • Enables flexible point-to-point operations for Mixture-of-Experts (MoE) routing, reinforcement learning updates, and disaggregated inference.
  • Eliminates hardware lock-in and improves scaling across network interface controllers (NICs).

TLDR:

A new research paper from Perplexity AI presents TransferEngine, a hardware-agnostic RDMA point-to-point communication framework that significantly boosts performance and scalability in large language model systems while maintaining portability across network architectures.

Researchers Nandor Licker (Perplexity AI), Kevin Hu, Vladimir Zaytsev, and Lequn Chen have unveiled a new approach to accelerate large-scale language models with a robust Remote Direct Memory Access (RDMA) communication layer named TransferEngine. Their paper, titled ‘RDMA Point-to-Point Communication for LLM Systems’ (arXiv:2510.27656), introduces an innovative design aimed at overcoming long-standing bottlenecks in distributed LLM infrastructure.

Traditional distributed inference systems and reinforcement learning pipelines depend heavily on collective communication primitives like AllReduce or Broadcast. However, emerging large model architectures such as Mixture-of-Experts (MoE), disaggregated inference, and reinforcement-based fine-tuning frequently require dynamic, flexible, and point-to-point data transfers. These workloads challenge the existing communication models that are often tightly coupled with specific network interface cards (NICs), limiting portability and scaling across heterogeneous environments. TransferEngine directly addresses this issue by presenting a unified communication interface compatible with multiple NIC platforms.

The TransferEngine framework introduces a new primitive, WriteImm, and an accompanying ImmCounter notification mechanism that simplifies event completion without enforcing ordering assumptions at the transport layer. It smartly manages multiple NICs per GPU, enabling seamless scaling and maximum bandwidth utilization. Experimental results show peak throughput reaching 400 Gbps on both NVIDIA’s ConnectX-7 hardware and Amazon’s Elastic Fabric Adapter (EFA) — a significant milestone for multi-GPU and multi-node training. The paper highlights practical deployments of TransferEngine in three production environments: high-speed KvCache transfer for disaggregated inference, reinforcement learning weight updates taking just 1.3 seconds on trillion-parameter models, and MoE dispatch/combine operations outperforming DeepEP latency baselines. These advancements position TransferEngine as a critical step toward fully hardware-agnostic, high-performance LLM communication infrastructure.

Beyond its performance metrics, the importance of TransferEngine lies in its contribution to open, vendor-agnostic AI systems. By abstracting away the hardware-specific differences of NIC implementations, Perplexity AI’s team enables researchers and developers to deploy distributed AI workloads without being constrained by proprietary network technologies. With LLM architectures continuing to scale beyond trillions of parameters, this new portable and high-bandwidth approach to RDMA communication promises to redefine the efficiency and accessibility of next-generation AI compute infrastructure.

Source:

Source:

Original research: ‘RDMA Point-to-Point Communication for LLM Systems’ by Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen (Perplexity AI), available at https://arxiv.org/abs/2510.27656

Leave a Reply

Your email address will not be published. Required fields are marked *