InfiniBand to Ethernet Transition

  • Enterprises are moving toward Ethernet as the medium of choice due to its affordability and the opportunity to create a converged fabric that carries regular enterprise traffic, along with RDMA workloads. Ethernet offers the following advantages:

    • The ability to protect your Ethernet investments and run your legacy socketed apps by using RoCE.

    • Additional performance with native RDMA applications can be gained by using verbs.

  • Another benefit of Ethernet is that network administrators are very familiar with it. However, Ethernet has some drawbacks, such as its lossy nature, the possibility of congestion, and visibility, which can harm workloads like high-performance computing (HPC) and AI/ML.

  • Note: Traditional applications that are written in any high-level programming language use operating system services to access the network through a communication point that is known as a socket. Native RDMA applications use the RDMA verbs-based application programming interface (API) to bypass the operating system kernel and access the network adapter directly.

Alt text
  • Significant motivations for the InfiniBand-to-Ethernet transition are as follows:

    • Cost efficiency: Ethernet hardware tends to be more cost-effective due to its widespread adoption and economies of scale.

    • Ecosystem and compatibility: Ethernet has a broad ecosystem with extensive vendor support, making integration with existing IT infrastructure easier.

    • Scalability: Ethernet networks are more scalable and can easily be extended across larger distances and more complex topologies that use standard IP routing.

    • Simplified management: Ethernet is often easier to manage and troubleshoot, with a larger pool of IT professionals familiar with its protocols and tools.

Technical Comparisons of Ethernet and InfiniBand

  • Both Ethernet and InfiniBand are mature technologies. Ethernet offers greater flexibility, versatility, lower costs, and is more widespread, while InfiniBand is designed for and is well-established in traditional HPC environments. Ethernet has surpassed InfiniBand in terms of network bandwidth; however, InfiniBand still has lower end-to-end latency than Ethernet. This situation could change in the future due to fast-evolving networking hardware and standards such as those from the Ultra Ethernet Consortium (UEC).

  • Traditionally, configuring and tuning advanced Ethernet parameters such as buffer size, priority flow control (PFC), and explicit congestion notification (ECN) on switches can be complex. To address this, Ethernet vendors provide solutions such as the Cisco Nexus Dashboard Fabric Controller management platform.

  • InfiniBand has evolved to support various features that enhance traffic forwarding performance, reduce fault recovery time, increase scalability, and lower operational complexity. On the Ethernet side, all these areas are being addressed in the upcoming Ethernet standards.

  • The following table compares key technical aspects of InfiniBand and Ethernet.

Feature
InfiniBand
Ethernet (RoCEv2)

Flow control mechanism

Credit-based

PFC and ECN

Forwarding mode

Forwarding by local ID

IP-based forwarding

Load-balancing mode

Packet-by-packet adaptive routing

ECMP (standard Ethernet)

Recovery

Network self-healing enhancements

Routing reconvergence

Ultra Ethernet Consortium

  • UEC is an industry initiative that is focused on developing a next-generation Ethernet-based and open full-stack architecture that is designed for high-performance networking, such as running AI and HPC workloads at scale. AI models and HPC workloads need larger clusters; therefore, network performance and total cost of ownership (TCO) are becoming limiting factors. The UEC aims to retain the advantages of Ethernet and IP and deliver the performance required for AI and HPC applications by providing better Ethernet transmission than the existing RDMA transport fabrics.

  • Existing workloads should be able to migrate to UEC without any changes, because existing AI frameworks and HPC library APIs are preserved. The UEC stack integrates with existing frameworks and uses Libfabric as a northbound API.

  • Note: Libfabric, also known as Open Fabrics Interfaces (OFI), is a low-level communication library (API) used for high-performance parallel and distributed applications.

  • The UEC architecture optimizes AI and HPC workloads by modernizing RDMA operation over Ethernet. The UEC stack introduces multiple innovations at the transport layer that are related to multipathing, congestion management, and reliability. These innovations enable higher network utilization and lower tail latency, which are critical for reducing AI and HPC job completion times. The UEC architecture is compatible with existing Ethernet switches and has optional extensions. One such extension is Link Level Reliability (LLR), which offers a fast hardware-based failover if there are link performance issues.

Alt text
  • The UEC stack simplifies networking software and improves its performance. Two important features of the UEC stack are as follows:

    • RDMA operations are tuned to better address workload expectations and minimize hardware complexity. Scheduled and enhanced Ethernet improves the performance of an Ethernet-based network and significantly reduces job completion time.

    • Ultra Ethernet Transport (UET) provides multiple transport services that enhance RDMA hardware. UEC stack enhances the transport layer with semantic adjustments, congestion notification protocols, and better security features. The UEC stack provides more flexible transport, eliminates the need for lossless networks, and allows features such as multipath and out-of-order packet transmission, which are required for many-to-many AI workloads.

  • Note: A slow network path within an AI/ML cluster will stall most of the GPUs until the transmission is complete. This situation is known as tail latency.

  • The UEC technology stack, as described in the UEC 1.0 Overview white paper, enables the following key features:

    • Multipath packet spraying: Instead of using simple Equal-Cost Multipath (ECMP) hashing, packets are load-balanced via multiple paths. This approach avoids congestion and improves application performance by reducing tail latency.

    • Flexible ordering: The Libfabric API allows applications to express workload-specific requirements for packet ordering.

    • Modern and easily configured congestion control mechanisms: Coordinated sender and receiver congestion control over multiple paths guides the packet forwarding, which provides the highest network performance for AI/ML operations.

    • End-to-end telemetry: Switch-based advanced telemetry helps control congestion by shortening control plane signaling time, thus enabling faster reaction to short congestion events, which are more common in high-bandwidth networks. Instead of dropping packets during congestion, switches can truncate a packet and deliver headers and congestion information to the receiver. This process enables more efficient and faster recovery techniques, which are based on selective acknowledgments instead of the traditional heavyweight Go-Back-N approach that is used by traditional RDMA technologies.

    • Multiple transport delivery services: Application requirements determine the selection of the optimal transport services.

  • The UEC architecture improves the current limitations of Ethernet by offering a high-performance, distributed, and lossless transport layer that is optimized for running HPC and AI at scale. Furthermore, the RDMA method of transferring data in large traffic blocks can cause network imbalances and excessive loads. Therefore, a new transport protocol that integrates RDMA for emerging applications is needed. The following figure explains the benefits of the UEC transport protocol.

Last updated