Application Level Protocols

RDMA Fundamentals

  • RDMA has been traditionally used in high-performance computing (HPC) and storage networking environments. RDMA operations are useful in extremely parallel computer clusters, such as artificial intelligence and machine learning (AI/ML) clusters, where the graphics processing units (GPUs) in the remote nodes can directly access each other’s memory (GPU Direct RDMA). RDMA traditionally requires a dedicated, purpose-built InfiniBand network with its own network adapters and switches. However, implementations of RDMA over Ethernet and IP transport networks also exist.

  • In a traditional network data path, the operating system fully controls the hardware, and the application uses operating system services such as system calls to access the hardware. Specifically, the user-space application process uses memory buffers through the sockets application programming interface (API) calls. In the kernel, the data path includes the TCP/IP stack down to the network adapter device driver and, eventually, the network fabric. This software-driven data movement processing consumes a lot of CPU cycles, which increases latency significantly and cannot achieve high throughput.

  • With RDMA, the kernel is bypassed, which avoids expensive CPU overhead that is associated with context switches and data copying between the user and kernel space (zero-copy operation). RDMA allows applications to have direct memory-to-memory data transfers between remote nodes over the network without burdening the CPU. The data transfer function is offloaded to a specialized network adapter, which allows regular CPU operations to continue in parallel. Hence, the application directly accesses the buffers on the RDMA-aware network interface card (NIC) and exchanges data with buffers on the remote host without involving operating systems on either side.

  • Note: In this training, the focus is on AI/ML applications running in the user space and exchanging data directly over the network. Other common RDMA applications are kernel-space processes, such as a file system exchanging data with a remote storage system.

  • Data transfer offloading enables high throughput, low-latency transfer of information between compute nodes at the memory-to-memory level. RDMA has a side benefit of reducing the power requirements, which lowers the total cost of ownership.

RDMA Architecture

  • RDMA has one simple goal—to allow applications to communicate directly over the network. This goal is achieved by applications using the RDMA message service through an API. The result is that a virtual channel is established internally between the two remote applications—between two disjointed application memory address spaces. Each end of the virtual channel has a work queue pair, which is mapped to application memory space. A work queue pair consists of two types of communication queues: a send queue and a receive queue. Also, there is a completion queue that tracks the completion of work requests. These queues allow applications to exchange messages directly by submitting work requests, which are effectively RDMA operations.

  • Note: Each application can set up multiple connections, and each connection will have its own dedicated work queue pair.

  • To enable this communication mechanism, the RDMA architecture stack has the following components, which are illustrated in the following figure:

    • API: The API allows an application to use RDMA through the RDMA message service. The RDMA API implements the appropriate methods and operations that applications use to access RDMA services. Methods specify the required behaviors of the APIs, as defined by the InfiniBand standard. This approach ensures compatibility across different operating systems and hardware vendors.

    • RDMA message service: The messaging service provides access to the RDMA hardware via work queue pairs and RDMA operations that are performed on top of queues. Instead of using operating system services, applications use the RDMA message service to exchange data directly. In effect, the message service abstracts the RDMA inner workings and hides the complexities of RDMA from an application.

    • RDMA work queue pairs: Each queue pair consists of a send queue (for sending data to a remote node) and a receive queue (for receiving data from a remote node). Depending on the type of operation needed, the application posts the appropriate work request to a send or receive queue. This action triggers the RDMA message service to transport the message on the channel. Once the processing is finished, an entry is placed in the completion queue, which notifies the application that the operation has been completed.

    • RDMA work requests (operations): RDMA operation is a method of transferring a message. Two different semantics are supported—memory and channel semantics—with their own set of supported operations.

    • RDMA-aware network adapter: The InfiniBand or Ethernet network adapter implements RDMA. The network adapter performs the requested operations by transmitting the message to a remote host. The network adapter also performs other control tasks that are necessary for establishing and maintaining a physical communication channel.

    • Network interconnect: This network comprises switches and cabling (InfiniBand or Ethernet).

    Alt text

    RDMA Operations

    • An RDMA application speaks to the RDMA-aware network adapter directly using the RDMA verbs-based API. Unlike the byte-based stream processing that is used by TCP, RDMA uses message-based transactions to transfer data across different hosts. Each message has a variable size and is a basic unit of work in an RDMA domain.

    • An RDMA API has two types of verbs, which can be thought of as communication styles, illustrated in the following figure.

      • Message Verbs: Message verbs are also known as channel semantics. Supported operations are RDMA Send and RDMA Receive. The destination host first prepares the data structure where the message will be stored on its receive queue. The source host then issues a send operation, which transmits a message to the destination side. So, message verbs operations require the participation of the CPU of the remote host since the sending side does not have visibiilty of the remote buffer

      • Memory Verbs: Memory Verbs are also known as memory semantics. In this communication style, the destination host registers a buffer in its memory space and passes control of that buffer to a source host. The source host can then issue RDMA Read and Write Operations, which allow reading from and writing to a remote memory, respectively. RDMA Atomic operations are also supported. These enable synchronization in distributed systems by ensuring atomicity via operations such as compare-and-swap and fetch-and-add performed on a remote memory. Some vendors offer additional proprietary enhancement as well. Memory verbs operation do not require any confirmation form the remote machine because sending side has visibility and control of the remote buffer.

      Alt text
      Characteristic
      Channel Semantics (Message Verbs)
      Memory Semantics (Memory Verbs)

      Reliability

      Both reliable and unreliable transport services are supported.

      Reliable transport only

      Synchronous or asynchronous mode

      Synchronous

      Asynchronous

      Use-case

      Transfer short control messages

      Bulk data transfers

  • Note: RDMA reliable transfer characteristics include correct order of delivery, error-free transmission, and data integrity.

  • The two communication styles complement each other and are used jointly for different use cases. For AI/ML workloads, memory-semantic RDMA read and write operations are used to enable low-latency and high-throughput data transfers between remote GPUs in the AI/ML cluster. The key capabilities and benefits that make the RDMA message-based operations model highly efficient for AI/ML and HPC workloads include the following:

    • Support for thousands to millions of work queue pairs.

    • Memory registration of user space buffers with the network adapter.

    • An ability to associate a registered buffer with multiple queue pairs, such as work queue entries that can reference any registered buffers.

    • Scatter or gather entries support; RDMA works natively with multiple scatter or gather entries, such as reading multiple memory buffers and sending them as one block or getting one block and writing it to multiple memory buffers.

    • Additional operations for InfiniBand, like Atomic Fetch & Add and Compare & Swap, and a mechanism to signal and retrieve the completion results of previously submitted work queue operations.

  • In summary, all the mentioned RDMA model characteristics and the efficiency of RDMA operations allow AI/ML applications to scale and take advantage of high-bandwidth AI/ML cluster resources in a shared, multitenant manner.

  • Note: Scatter-gather operations efficiently handle discontiguous data that is distributed across different memory regions. This capability is very useful for transferring complex data structures or large data sets.

RDMA over Converged Ethernet (RoCE)

  • In its first implementation, the InfiniBand Trade Association (IBTA) brought the full benefits of RDMA to the market using the InfiniBand network to provide high throughput, CPU bypass, and lower latency. InfiniBand also built congestion management into the protocol. Although these benefits made InfiniBand the HPC transport of choice, it required a custom and dedicated InfiniBand network. These purpose-built networks brought additional cost and complexity to the enterprise.

  • As an alternative to InfiniBand, the IBTA specification describes using an Ethernet network for transport, which is known as RDMA over Converged Ethernet (RoCE). This effectively layers RDMA protocols and messaging service on top of Ethernet transport, which allows you to have the full benefits of RDMA architecture in the data centers where Ethernet is already present.

  • Ethernet is lossy by design, which is not ideal for running AI/ML and HPC workloads because lost packets cause performance degradation. By taking advantage of the latest Ethernet innovations, such as data center bridging (DCB) extensions that include priority flow control (PFC) and explicit congestion notification (ECN), you can have the same delivery guarantees that native InfiniBand does.

  • These innovations effectively provides you with the same RDMA benefits, such as RDMA data transfers, efficient kernel stack bypass, and other features that are not possible in traditional TCP over Ethernet. RoCE delivers most of the benefits of InfiniBand architecture but does so over Ethernet without requiring a significant upgrade. Alt text

  • RoCE version 1 (RoCEv1) is an IBTA standard that was introduced in 2010 and works only on the same Layer 2 broadcast domain. The RoCEv1 protocol is an Ethernet link layer protocol with EtherType 0x8915. RoCE version 2 (RoCEv2) was introduced in 2014 and allows traffic routing over IP fabrics. RoCEv2 encapsulates InfiniBand transport in Ethernet, IP, and UDP headers to be routed over Ethernet networks. The RoCEv2 protocol exists on top of the UDP-IPv4 or the UDP-IPv6 protocol. The UDP destination port number 4791 has been reserved for RoCEv2. Hence, the internal packet structure between the two protocols is different, as illustrated in the following figure. Alt text

  • So, although both RoCEv1 and RoCEv2 provide the same benefits from an application perspective, RoCEv2 allows for much more flexible network designs. RoCEv2 is considered a standard in today’s data centers, so whenever RoCE is mentioned, it usually refers to RoCEv2. RoCEv1 and RoCEv2 are compared in the following table.

RoCEv1
RoCEv2

Operates over Layer 2 (Ethernet) and requires a lossless Ethernet fabric.

Operates over Layer 3 (IP), enabling it to work across routed networks.

Achieves lossless transmission using DCB techniques.

Uses UDP encapsulation, allowing RoCEv2 traffic to traverse IP networks.

Typically limited to a single Ethernet broadcast domain due to its Layer 2 nature.

More scalable than RoCEv1 due to its ability to use IP routing.

  • RoCEv1 is supported on older Cisco UCS Virtual Interface Card (VIC) 1300 Series adapters. RoCEv2 is supported on the newer Cisco VIC 1400 and 15000 Series adapters. RoCEv2 is not backward compatible with RoCEv1.

  • Note: For RoCE to work properly, it is crucial to configure the NICs and operating system correctly.

RoCE Workflow

  • First, the Ethernet fabric needs to be set up and configured properly, including the configuration of a dedicated VLAN, quality of service (QoS) configuration, DCB features on switches, and installation and tuning of the NIC operating system driver and libraries. Only after the fabric is fully operational can the RoCE workflows be established. The RoCE workflow setup is transparent because it is performed automatically but involves many steps, such as registering memory, establishing connections, handling completions, and eventually tearing down the connections.

  • Typical RoCE communication flow has many stages:

    • Initialization: RDMA NICs initialize and register memory regions. Queue pairs are created and configured

    • Connection Establishment: RoCEv2 establishes connections using IP addresses and UDP ports, whereas RoCEv1 uses Ethernet MAC addresses

    • Data Transfer: Data is transferred directly between registered memory regions using RDMA read, write or send operations, bypassing the CPU

    • Completion Handling: After operations are completed, completion queues are updated to signal the application's completion status

Last updated