AI Transport

Optical and Copper Cabling

  • Optical and copper cabling form the backbone of data center networking. As data rates continue to increase and the demand for more efficient, high-speed data transmission grows, the correct optic and cable choices are essential for ensuring optimal performance, reliability, and scalability of data center networks.

Optical Transceivers

Optical transceivers convert electrical signals into optical signals and vice versa, enabling data transmission over fiber-optic cables. They consist of two main parts: a transmitter and a receiver. They are essential for high-speed communication in data centers, particularly for inter-rack and inter-data center connections.

  • Optical transceivers come in various types, differentiated by form factor, distance, and wavelength type.

Form-Factor Types

  • The early transceivers supported low speeds and are not in use today. Gigabit Interface Converters (GBICs) supported 2.5 Gbps and were released in 1995; Small Form-Factor Pluggable (SFP) transceivers eventually replaced GBICs.

  • Initially designed to support Ethernet, fiber channel, and carrier optical networking applications, improved versions of SFP optical transceivers (SPF Plus [SFP+] and Quad SPF [QSFP]) that run at faster data speeds have replaced earlier modules. Some implementations also use a new form factor, Octal SFP (OSFP), instead of the previous version, QSFP. OSFP and QSFP-DD800 transceivers support speeds up to 800 Gbps.

  • The following table lists the most common form factors.

Form Factor
Speed

Alt text

SFP

1 Gbps

Alt text

SFP+

10 Gbps

Alt text

QSFP and QSFP28/56/112/DD/112-DD/DD800

Up to 800 Gbps

Alt text

OSFP

800 Gbps

Optical Connector Types

  • Fiber-optic cables have various types of connectors, each designed for a different purpose. The fiber connector types, sometimes called terminations, connect both sides of a fiber-optic cable to terminals, switches, adapters, and patch panels.

  • The following table describes the most relevant connectors.

Connector Type
Description

Alt text

Lucent Connector (LC)

The LC connector is a small form-factor solution to fiber-optic connections. It is the most commonly used fiber-optic connector.

Alt text

Corning/Senko (CS)

The CS is a Very Small Form Factor (VSFF) connector. Because of its small form factor, it is specifically designed for high-performance and AI/ML data centers.

Alt text

Standard Connector (SC)

The SC connector is an older connector that is being replaced by LC connectors.

Alt text

Straight Tip (ST)

The ST connector is one of the first fiber connector types adopted in fiber-optic networks worldwide.

Alt text

Ferrule Connector (FC)

The Fibre Channel connector has a ceramic ferrule with a stainless-steel screw mechanism for attachment, in contrast to the plastic bodies of most other fiber connector types, like SC and LC connectors.

Alt text

Multiple-Fiber Push-On (MPO)

The MPO connector combines up to 24 glass fibers in a rectangular ferrule. Because of its high density, it is commonly used in high-performance and AI/ML data centers.

Organizing Data Center Cabling

  • Developing a physical network design involves following the recommendations and guidelines to ensure a smooth and maintainable cable deployment. Cabling can consist of copper cables and optic fibers. In addition to cables, you can use elements to ensure that cabling is organized flexibly to facilitate updating, troubleshooting, and modifying cabling in the future.

Cable Technologies

  • Ethernet cables use two media types: copper and optical fiber.

  • Ethernet LANs are mostly built using unshielded twisted-pair (UTP) copper cables. UTP cable comprises of eight copper wires that are grouped in four twisted pairs. Each pair has a color scheme: one wire is solid colored, and the other is the same color but striped. Color-coded labels are typically provided for T568A and T568B wiring configurations to ensure correct wire termination.

  • UTP cables support speeds up to 10 Gbps. To achieve higher speeds, optical fiber must be used.

  • Using optical fiber for data center fabric is critical in high-performance data centers, high-performance computing (HPC), and AI/ML data centers to achieve high throughput. It is also required in scenarios where length and throughput requirements between nodes are not supported using copper cable, for example, to build a backbone of a network with a high throughput.

  • A typical Ethernet link with fiber needs two fiber cables, one for each direction, and two transceiver modules such as SFP, SFP+, QSFP, or OSFP. Note that the transmitting end on one device connects to a cable that connects to the receiving end on the other device and in reverse for the other cable.

Patch Panel

  • A patch panel is a frame with a series of optical or RJ-45 ports into which cables (patch cords) are inserted. A fiber-optic and Ethernet patch panel connect the respective cables (fiber optics and Ethernet).

  • Fiber-optic patch panels are used to manage and organize fiber-optic cables and provide a way to quickly and easily connect equipment. A fiber-optic patch panel is used to terminate the connection in the rack, where the connection originates from a different rack or location.

  • The panels enable you to facilitate breakout connectivity regardless of the data rate. The Cisco solution of panel and cable assemblies is appropriate for any breakout from 4x10 Gbps to 400 Gbps native. The panels are compatible with top-of-rack (ToR), middle-of-rack (MoR), and end-of-row (EoR) layouts and are specifically designed for high-density data centers enabling AI and ML workloads.

  • This type of optical patch panel is specifically designed to connect fiber optics that are terminated with LC or CS connectors and enable MPO connectors that are usually used with 400-Gbps speeds. One key feature of the panel is also the ability to enable cable breakout, as explained in the following figure: each link of 400 Gbps, for example, can break out to four separate links of 100 Gbps. Alt text

  • There are two basic types of Ethernet patch panels.

    • Punch-down patch panels feature connectors where cables are punched down onto metal pins to establish connections. They provide reliable and secure connections that are ideal for permanent installations.

    • Feed-through patch panels, also called pass-through or coupler patch panels, feature ports where cables are inserted directly without the need for punching down. They offer quick and easy installation, allowing faster deployment and changes.

Recommendations for Cabling

  • Proper planning and cable management are crucial due to the volume of cables in large configurations, the size of SFP and QSFP connectors, and the bend radius of different optical cables.

  • The key recommendations for cabling are as follows:

    • Cables that are destined for the same switch at the same rack position must be split in a 50/50 manner (half the cables on the left side of the rack and half on the right side).

    • A good cable layout enables greater serviceability of all equipment.

    • If cables are run out the bottom and top of the rack, run them in quarters.

  • The key recommendations for cable management are as follows:

    • Bundle cables together in groups by relevance to ease management and troubleshooting.

    • Use cables of the correct length. Leave only a little slack at each end.

    • Keep copper and fiber runs separated.

    • Install spare cables in advance for future replacement of damaged cables.

    • Use color coding of the cable ties.

    • Place labels at both ends and along the run.

Ethernet Cables

  • Ethernet in data centers consists of a family of multilane link types with various signal rates per lane. Ethernet uses copper or fibre-optic cables. Single-lane link support provides Ethernet with a wider range of speeds than InfiniBand.

Name
Lanes
Signaling Rate per Lane (Gbps)
Effective Bandwidth (Gbps)
Connector
Deployed

1 Gigabit Ethernet

1

1

1

SFP and RJ-45

Latency, Connectors and Breakout Cables

  • Latency: in addition to spped of light delay, which is 5 nanoseconds per meter, copper Ethernet links may require Forward Error Correction (FEC), which introduces additional link delay of up to 120 nanoseconds because it adds redundant error-checking codes to every data frame

    • FEC is a method for obtaining error control in data transmission in which the source (transmitter) sends redundant data, and the destination (receiver) recognizes only the portion of the data that contains no apparent errors. When FEC is used in data transmissions, the receiver can detect and correct a limited number of errors. If there are too many errors, the sender must retransmit the packets that contain the errors

  • Connectors: In addition to QSFP connectors that are common between Ethernet and InfiniBand, Ethernet uses SFP connectors for 10GbE (SFP+), 25 GbE (SFP28), and 50 GbE (SFP56) cables

InfiniBand Cables

  • Cables are a key element in InfiniBand performance and scalability. To achieve scalable performance, InfiniBand implements a multilane cable architecture that strips a serial data stream across N parallel physical links running at the same signaling rate. The following figure shows three link widths, referred to as 1X, 4X, and 12X (1, 4, and 12 parallel lanes).

  • InfiniBand uses copper or optic fiber cables. In copper cables, each lane (link) uses four conductors (two differential pairs, one pair for transmitting, and one for receiving). The performance of InfiniBand has significantly increased from the original specification single data rate (SDR) to the current version eXtended Data Rate (XDR). The following table summarizes InfiniBand generations.

| Name | Signaling Rate per Lane (Gbps) | Effective Bandwidth for 4X link (Gbps) | Connector | year | | SDR | 2.5 | 8 | CX4 | 2003 | | DDR | 5 | 16 | CX4 | 2006 | | QDR | 10 | 32 | QSFP | 2008 | | FDR10 | 10 | 39 | QSFP | 2013 | | FDR | 14 | 54 | QSFP | 2013 | | EDR | 25 | 97 | QSFP28 | 2015 | | HDR | 50 | 200 | QSFP56 | 2019 | | NDR | 100 | 400 | OSFP | 2021 | | XDR | 250 | 800 | OSFP | 2023 |

Latency, Connectors and Breakout Cables

  • Latency: Compared to Ethernet cables, there are differences due to in-connector electronics and the physical medium (copper compare to glass), but latency due to cable length is roughly 5ns per meter for both InfiniBand and Ethernet cables

  • Connectors: InfiniBand uses different connectors for the InfiniBand generation. In use today are QSFP and OSFP connectors. The OSFP pluggable optical transceiver is designed with improved signal integrity and thermal performance. It will ease the upgrade from 400 Bgps to the next generation of 800 Gbps optics (eight lanes of 100Gbps) for data centers and other mid-range reach applications

  • Breakout cables: A physical switch port can be splitted into multiple ports. This is similar to Ethernet port splitting, for example, converting a four-lane 100 GbE port into four 25 GbE connections. Break out cables establish connectivity between switches and between switches and endpoints (adapters and data processing units [DPUs]). Multiple combination of breakout cables and solutions are available in optical and copper variants.

Ethernet Connectivity

  • An AI network supports large and complex workloads running over individual compute and storage nodes that work together as a logical cluster. AI networking connects these large workloads via a high-capacity interconnect fabric, which includes advanced interconnect technologies based on optics and cables. This network is referred to as the back-end network.

  • The back-end network initially relied on proprietary technologies, such as InfiniBand and Fibre Channel. When the industry invented Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE), workloads began converging in a single Ethernet-based network.

  • Ethernet is a family of computer networking technologies that are commonly used in LANs, metropolitan-area networks (MANs), and WANs. It is considered the most widely used network technology. Speeds can vary from 10 Mbps up to 800 Gbps. It is highly scalable and flexible, making it suitable for various applications, including RoCE.

  • In the following figure, a simplified network architecture of such a data center is displayed:

    • The front-end network is used for connectivity to the internet and clients.

    • The back-end network is used for fabric connectivity: spine (fabric) switches connect with ToR switches (leaf switches), which enables connectivity between the graphics processing unit (GPU), storage, and compute nodes.

  • The back-end network requires high-speed 100-, 200-, 400-, or 800-Gbps optics to connect servers and network switches. These high-bandwidth connections are essential for handling the data generated by AI workloads.

RDMA over Converged Ethernet

  • RoCE is a network protocol that facilitates data transfer between remote memory on multiple GPU nodes over the network.

  • The following figure shows the primary benefit of using RoCE: it enables direct memory-to-memory communication. Alt text

  • The protocol can be used on non-converged (traditional) and converged Ethernet networks. This effectively layers RDMA protocols and messaging services on top of an Ethernet transport, which allows you to have the full benefits of RDMA architecture in data centers where Ethernet is already present.

Network Load Balancing in AI/ML

  • AI/ML clusters are designed to run many simultaneous and independent jobs over the same network.

  • A job is split across multiple GPUs to distribute the load, and the cluster performs calculations on the dataset in parallel. Each GPU performs a smaller portion of the calculation and sends the results to all its peers in a transmission process known as the All-to-All collective.

  • As more jobs execute independently, the job-to-job interference increases. As network congestion increases, tail latency increases. The synchronous nature of AI/ML algorithms magnifies the effects of tail latency. Therefore, AI/ML workloads are mostly network-bound, which means they heavily depend on network bandwidth and performance.

  • Standard Ethernet performs quite well for single jobs. However, performance decreases when additional jobs are added to the network. Enhanced and Scheduled Ethernet are used to improve multijob performance.

  • Standard Ethernet and Enhanced Ethernet enable fully open standards, broad availability, and have a good cost-to-bandwidth ratio. Scheduled Ethernet provides nonblocking performance but is vendor-specific.

  • Cisco Silicon One is helping the industry meet rapid large-scale buildouts of AI/ML that networks of the past could not handle.

  • Using Cisco Silicon One, you can configure the network to deploy standard Ethernet, Enhanced Ethernet, or Scheduled Ethernet to achieve best-in-class characteristics for each use case.

InfiniBand Connectivity

  • AI workloads are computationally intensive. To speed up model training and processing of large datasets, a distributed computing approach is used. This approach involves distributing the workload across interconnected servers or nodes via a high-speed, low-latency network.

  • InfiniBand provides ultra-low latency and is a network of choice in scientific computing and AI applications. The InfiniBand interconnect solutions, servers, and storage that are integrated with accelerated compute nodes deliver optimum performance to meet high-speed, low-latency network requirements.

  • InfiniBand offers multiple link performance levels, reaching speeds as high as 800 Gbps. Each link speed also provides low-latency communication within the fabric, which enables higher throughput than other protocols.

InfiniBand Overview

  • InfiniBand is a communication link for data flow between processors and I/O devices, supporting up to 64,000 addressable devices. InfiniBand architecture is an industry-standard specification that defines a point-to-point switched I/O framework for interconnecting servers, communications infrastructure, storage devices, and embedded systems.

  • InfiniBand is ideal for connecting multiple data streams (clustering, communication, storage, and management). The smallest complete InfiniBand architecture unit is a subnet. Routers connect multiple subnets to form a large InfiniBand architecture network.

InfiniBand Layers

  • Physical Layer: Physical connections such as cables, connectors, and transceivers

  • Link Layer: Point-to-point link operations and switching within a local subnet. The subnet manger assigns a 16-bit local ID (LID) to all devices within a subnet. All packets sent within a subnet use a LID for addressing

  • Network Layer: responsible for packet routing

  • Transport Layer: Responsible for in-order packet delivery, partitioning, channel multiplexing, and transport services (reliable connection, reliaable datagram, unreliable connection, unreliable datagram, raw datagram)

    • Part of the transport layer is also InfiniBand Architecture (IBA) Subnet Administration and Routing (SAR):

      • Subnet Administration: This involves managing the configuration and operation of the Infiniband subnet, including tasks like assigning addresses, managing routing tables, and ensuring efficient data flow

      • Routing: InfiniBand uses a sophisticated routing mechanism to ensure data packets are delivered efficiently across the network. This includes both unicast and multicast routing

  • As illustrated in the following figure, InfiniBand systems consist of the following:

    • Channel Adapter (CA): The CA is divided into a Host Channel Adapter (HCA) and a Target Channel Adapter (TCA):

      • An HCA is a device point through which an IB end node, such as a server or storage device, connects to an IB network.

      • A TCA is a special form of channel adapter that is mostly used in embedded environments such as storage devices.

    • Switches: The switches in an InfiniBand system are similar in principle to other standard network switches but must meet InfiniBand's high performance and low-cost requirements.

    • Routers: Routers forward packets from one subnet to another without consuming or generating packets.

    • Cables: Cables are copper and optical.

    • Connectors

    Alt text
  • Key features of InfiniBand

    • Speeds up to 800 Gbps

    • Ultra-Low Latency (Under 1 nanosecond Application to Application)

    • Reliable, Lossless, Self-Managing Fabric

    • Full CPU Offload and Kernel Bypass

    • Memory Exposed to Remote Nodes (RDMA)

    • Quality of Service

    • Up to 64k Addresses

  • InfiniBand natively supports RDMA, which is one reason it has historically been preferred for HPC. With the evolution of RoCE, Ethernet is becoming a preferred technology over InfiniBand due to its low cost, compatibility, ubiquity, and ease of use.

Hybrid Connectivity

  • In a data center, hybrid connectivity involves multiple network technologies used in the data center. Examples of these technologies include, but are not limited to, Fibre Channel, InfiniBand, and Ethernet. Especially in AI/ML data centers, InfiniBand, Fibre Channel, and Ethernet can coexist to optimize performance, scalability, and cost-efficiency for AI/ML workloads. Each technology has distinct strengths and is helpful in specific use cases.

  • The following figure shows an example of a data center network with two different technologies: Ethernet and InfiniBand. Alt text

  • In a data center with hybrid network connectivity, InfiniBand and Ethernet are deployed together, each optimized for different parts of the AI and HPC workloads.

  • InfiniBand is typically used to connect the compute nodes within an AI cluster, where large amounts of data are exchanged between GPUs and Tensor Processing Units (TPUs) during model training.

  • Ethernet is used for general connectivity and is often used to connect servers to storage systems, where reliability and cost-efficiency are critical. With link speeds up to 800 Gbps, Ethernet provides enough throughput for data storage and retrieval. Ethernet is also the primary network for general-purpose communication within the data center. It connects various parts of the infrastructure, such as management systems, less latency-sensitive compute nodes, and client connections to the AI services.

Considerations for Hybrid Deployment

  • Data centers can save costs by reserving InfiniBand for performance-sensitive tasks and using Ethernet for other tasks due to its lower cost and broader availability.

  • The combination of InfiniBand and Ethernet also allows data centers to independently scale different parts of their InfiniBand and Ethernet infrastructures based on the AI/ML workload requirements. This combination enables you to tailor network performance to the needs of specific applications. This approach decreases total cost of ownership (TCO) and increases return on investment (ROI).

  • Ethernet might become a preferred option over InfiniBand due to the evolution of Scheduled and Enhanced Ethernet, so it is important to scale the InfiniBand deployments carefully to avoid unnecessary costs.

  • To successfully connect all compute nodes, InfiniBand and Ethernet are often interconnected using special gateways or bridges.

  • These devices enable communication between InfiniBand-based compute clusters, Ethernet-based storage, and external networks. They handle protocol translation and ensure seamless data transfer.

  • The IP can also be used on an InfiniBand infrastructure for certain use cases. InfiniBand uses a different addressing mechanism, which isn't IP, and doesn't support sockets, so it doesn't support legacy applications by default. To solve this problem, the IETF IPoIB working group specified IP over InfiniBand (IPoIB). IPoIB is a protocol that allows IP traffic to be carried over an InfiniBand network. This capability is important for environments where InfiniBand is used as the physical transport layer, but applications and services still rely on standard IP networking protocols.

Last updated