RDMA and RoCE for Ethernet Network Efficiency Performance
Remote Direct Memory Access (RDMA)
Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth. In contrast, TCP/IP communications typically require copy operations, which add latency and consume significant CPU and memory resources.
RDMA over Converged Ethernet (RoCE)
RDMA over Converged Ethernet (RoCE) is a standard protocol which enables RDMA’s efficient data transfer over Ethernet networks allowing transport offload with hardware RDMA engine implementation, and superior performance. RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.
RoCE Fabric Consideration
Mellanox ConnectX-4 and later generations incorporate Resilient RoCE to provide best of breed performance with only a simple enablement of Explicit Congestion Notification (ECN) on the network switches. Lossless fabric which is usually achieved through enablement of PFC is not mandated anymore. The Resilient RoCE congestion management, implemented in ConnectX NIC hardware delivers reliability even with UDP over a lossy network.
Mellanox Spectrum Ethernet switches provide 100GbE line rate performance and consistent low latency with zero packet loss. With its high performance, low latency, intelligent end-to-end congestion management and QoS options, Mellanox Spectrum Ethernet switches are ideal to implement RoCE fabric at scale. Additionally, Spectrum makes it easy to configure RoCE and has end-to-end flow level visibility.
Implementing Applications over RDMA/RoCE
Application developers have several options for implementing acceleration with RDMA/RoCE using RDMA infrastructure verbs/libraries or middleware libraries:
- RDMA Verbs - Using libibverbs library (available inbox for major distributions) provides API interfaces needed to send and receive data
- RDMA Communication Manager (RDMA-CM) - The RDMA CM library is a communication manager (CM) used to set up reliable, connected, and unreliable datagram data transfers. It works in conjunction with the RDMA verbs API that is defined by the libibverbs library.
- Unified Communication X (UCX) - Open-source production-grade communication framework for data-centric and high-performance applications driven by industry, laboratories, and academia http://www.openucx.org.
- Accelio - A high-performance asynchronous reliable messaging and RPC open-source community driven library
NOTE: Accelio is no longer recommended for new projects. For new projects, please refer to UCX.
Soft RoCE is a software implementation of RoCE that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. Soft-RoCE is released as part of upstream kernel 4,8 as well as with Mellanox OFED 4 and above.
The Soft-RoCE distribution is available at:
- Zero-copy: Send and receive data to and from remote buffers
- Kernel bypass: improving latency and throughput
- Low CPU involvement: Access remote server’s memory without consuming CPU cycles on the remote server
- Convergence: Single fabric to support Storage and Compute
- Close to wire speed performance on Lossy Fabrics
- Available in InfiniBand and Ethernet (L2 and L3)
Where is RDMA used?
- High Performance Computing (HPC): MPI and SHMEM
- Machine learning: TensorFlow™, Caffe, Microsoft Cognitive Toolkit (CNTK), PaddlePaddle and more
- Big data: Spark, Hadoop
- Data Bases: Oracle, SAP (HANA)
- Storage: NVMe-oF (remote block access to NVMe SSDs), iSER (iSCSI Extensions for RDMA), Lustre, GPFS, HDFS, Ceph, EMC ScaleIO, VMware Virtual SAN, Dell Fluid Cache, Windows SMB Direct
- White Paper: Enabling Scalable and Super-fast Kubernetes Networking for AI
- White Paper: RoCE in the Data Center
- Competitive Analysis: RoCE vs. iWARP
- White Paper: RoCE vs. iWARP - The Facts You Should Know
- RDMA Aware Programming User Manual
- Running RoCE Over L2 Network Enabled with PFC
- Soft-RoCE README
- Improve Data Transfer Efficiency
Hardware support for RDMA and RoCE
Software Drivers support for RDMA and RoCE
RDMA and RoCE are supported on major operating systems from these versions: