Network Challenges and Requirements for AI Workloads
Bandwidth and Latency
There are two reasons for distributed processing:
To speed up training (each host processes a different batch of data with the same ML model).
To handle many simultaneous inferencing queries coming from users spread across various locations.
AI/ML models process large volumes of data in parallel across multiple GPUs in the cluster. The more GPUs you build into your cluster and the more data you can process, the faster and more accurate your training will be. So, clusters may include thousands of GPUs, where each server is equipped with multiple GPUs and network interface cards (NICs).
Note: In production AI/ML clusters, the GPU:NIC ratio in each host is typically 1:1 or 2:1.
Training jobs are highly distributed, which means that a high volume of data flows between these components:
Main memory and GPU memory via Peripheral Component Interconnect Express (PCIe) bus.
GPUs in the same host using NVIDIA's high-bandwidth NVLink communication link.
GPUs across different hosts in the cluster using the dedicated Ethernet network.
The network should be designed to extend intraserver high-bandwidth communication without causing a bottleneck. Its role is to enable a high-performance fabric so that all the GPUs in the cluster can behave like one large GPU.
The back-end network is used only for GPU-to-GPU connectivity to transport Remote Direct Memory Access (RDMA) (RDMA over Converged Ethernet [RoCE]) traffic. It must be lossless and nonblocking. The front-end network is used for all other communications, such as management, storage, and user application traffic.
Note: RoCE is a transport mechanism that uses Ethernet, instead of native InfiniBand, to transport RDMA traffic in a data center.

AI/ML Application Networking Requirement
Training is the most demanding stage in terms of hardware resources needed. It requires a high-throughput environment for both storage and networking because the training phase involves full-rate, long-running flows. Also, workflow phases (process, notify, and synchronize) make the traffic predictive and bursty. Techniques that enable predictable quality of service (QoS), especially congestion management, are needed to handle this traffic. Traffic should be optimally load-balanced across the entire fabric to take advantage of the available bandwidth. Any network bottleneck, however small, which, for example, slows down a single flow, affects the entire job because all flows must be completed for the training process to continue. So, the cluster is as fast as the slowest component regarding a single job completion time (JCT).
High latency or packet drops in server cluster communications can prolong or disrupt training jobs. This situation occurs traffic loss forces retransmissions, which slow down the entire training workflow for that job. For this reason, a nonblocking network is considered necessary for AI fabrics.

Inference frameworks apply trained models to new data to quickly predict outcomes or make decisions. Unlike training, which is commonly done on-premises, inference systems are often used in smartphones or self-driving cars. For this reason, the models for inferencing are typically deployed on the network edge, in a location as close as possible to the end user (remote data center, mobile base station, and similar). The inference is more latency-sensitive rather than throughput-sensitive because it makes real-time decisions based on queries or user prompts using smaller datasets. To accommodate this situation, the network should be optimized for near real-time operations.

Because many queries are executed simultaneously, the cluster must always provide low latency without variations (low jitter). Jitter can not only disrupt the data loading timing, which reduces the training efficiency, but it can also affect the responsiveness of AI/ML models during inferencing and make the application less reliable.
Scalability Considerations
Due to constantly increasing capabilities, AI/ML models double in size every couple of months. In addition, high accuracy is always required, which means that more quality data is needed and more processing iterations during training for a specific model. This increases the server-to-server communication over a network. To meet these requirements, a form of spine and leaf architecture is used, sometimes referred to as Clos topology.
The Clos network is a type of multistage switching network architecture that was originally designed by telecommunications engineer Charles Clos in 1952. In an enterprise data center environment, a two-tiered Clos network is typically used. - A two-tiered Clos network is a network architecture where leaf switches form the first tier and spine switches form the second tier. The design dictates that each spine switch connects to all leaf switches, and each server in the data center is connected to multiple leaf switches. The following figure illustrates a two-tiered Clos network.
In a two-tiered Clos network, each server in the data center is just three hops away from another server, as illustrated by the orange path in the following figure.

The first hop is from the server to the directly connected leaf switch, the second hop is across the spine switches to the destination leaf switch, and the third hop is between the destination leaf switch and the destination server. This path means that irrespective of the number of devices in the data center, the number of hops between the servers or the end hosts is always three, ensuring consistent latency in the network.
A two-tiered Clos network architecture is highly scalable. If you need to expand the cluster (add more hosts) and you need additional ports to enable connectivity, add a leaf switch. Because each leaf switch needs to be connected to every spine switch, ensure you have enough downlink ports on spine switches. If you need to increase the bandwidth capacity of your fabric, instead of replacing existing leaf-spine cabling, which is impractical, you can add more spine switches and connect them to every leaf switch. The scaling of the leaf tier for connectivity and spine tier for bandwidth is illustrated in the following figure.

Link oversubscription is still possible in the example topology because suboptimal load-balancing algorithms (such as traditional Equal-Cost Multipath [ECMP]) can still lead to uneven link utilization and congestion.
Note:
Additional ports for the host
adding a leaf switch
Increases fabric bandwidth
adding a spine switch
Increase host bandwidth
adding a NIC
Increase the cluster compute capacity
Adding a host
Redundancy and Resiliency
Failures impact your network, but you can increase the resiliency of your network design.
AI/ML networks must stay resilient despite the rapid increase in the number of devices. Network resiliency is achieved by increasing redundancy and enabling adequate failure detection methods. The larger the network, the higher the probability of a device failure. However, the impact of individual failure is smaller. Failure usually leads to reduced fabric bandwidth. For example, in a four-spine fabric, a single spine switch failing will result in a 25 percent reduction of fabric bandwidth capacity. In an eight-spine fabric, it would only be 12.5 percent, and so on.
If a leaf switch fails, the locally connected hosts will lose connectivity. However, because each host is dual-homed to two leaf switches, the host will not be disconnected from the fabric. The bandwidth available to the impacted hosts will be halved until full connectivity is restored.
In both cases, the traffic flows that were in transit during the failure will be affected. The packets that are lost on the failed switch will need to be retransmitted. There will be some impact on the AI/ML job performance, but usually, the affected jobs will not fail.
The level of impact depends on the following:
Failure detection method: Preferably, hardware-based failure detection is supported, which allows for a quicker reaction to failure and minimizes packet loss.
Retransmission method: Selective retransmission, available in modern Scheduled Ethernet fabrics, is significantly more efficient than the traditional go-back-N approach.
Note: Hardware-based link failure detection also avoids traffic "blackholing" (sending the traffic on the failed link).
Visibility
Visibility, troubleshooting, root-cause analysis, and remediation of network issues are common challenges for day-to-day network operations. AI applications require network visibility into hotspots and issues so they can be tuned as necessary. The monitoring, logging, and auditing capabilities of Cisco Nexus Dashboard Insights can help you address these challenges.
Cisco Nexus Dashboard Insights
Cisco Nexus Dashboard Insights, a modern networking operation service, simplifies and automates operation tasks such as monitoring and troubleshooting. It provides deep visibility across the entire infrastructure by ingesting real-time streamed network telemetries from all devices.
Cisco Nexus Dashboard Insights receives network telemetry data from the network devices. It obtains fine-grained visibility through the telemetry data, including the control and data plane operations and performance. It detects and highlights components that exceed capacity thresholds using fabric-wide visibility of resource utilization and historical trends. It also analyzes and learns about the network's baseline behavior and detects anomalies and their root causes. Also, Cisco Nexus Dashboard Insights provides assisted auditing and compliance checks using searchable historical data that are presented in the time-series format.
Monitoring
Cisco Nexus Dashboard Insights establishes a baseline for utilization of resources, monitors trends, and detects anomalies of abnormal resource usage across nodes to help you plan your capacity needs in the network. The Resource Utilization feature shows time-series-based capacity utilization trends by correlating software telemetry data that is collected from nodes in each site. Persistent trends help identify overloaded or malfunctioning devices. This information will guide you in planning for resizing, restructuring, and repurposing.
The following figure shows the Resource Utilization feature of Cisco Nexus Dashboard Insights, where you can observe Site Capacity by Utilization across different resources (such as Bridge Domains, Contracts, and so on) and Top Nodes by Utilization.
The Resource Utilization feature categorizes capacity utilization as follows:
Operational resources: The capacity of temporary resources that are dynamic and expected to change over short intervals. Examples are routes, MAC addresses, and security ternary content-addressable memory (TCAM).
Configuration resources: The capacity utilization of configuration-dependent resources, such as the number of virtual routing and forwarding (VRF) instances, bridge domains, VLANs, and endpoint groups (EPGs).
Hardware resources: The port and bandwidth-capacity utilization.
Flow Telemetry is a Cisco Nexus Dashboard Insights feature that uses flow records and respective counters. This data is correlated over time to provide an end-to-end flow path and latency. Flow Telemetry calculates the average latency of each flow. When the latency exceeds a specified threshold, it alerts users and shows the abnormal latency as an anomaly on the dashboard.
The flow analytics dashboard shows key indicators of infrastructure data plane health. Time-series data offer evidence of historical trends, specific patterns, and past issues and help the operator build a case for audit, compliance, capacity planning, or infrastructure assessment.
The flow analytics dashboard provides a time-series-based overview, including the following metrics:
Top Nodes by Average Latency: This metric shows the top nodes by the highest average end-to-end latency.
Top Flows by Average Latency: This metric shows time-series-based latency statistics. Clicking a particular flow shows detailed flow data, including latency, the exact path of the flow in the fabric, and the end-to-end latency. This process removes trial-and-error and manual steps that are otherwise needed to pinpoint latency hotspots in the infrastructure by providing information about the root cause of the increased latency.
Top Flows by Packet Drop Indicator: This metric shows time-series-based packet drop statistics. Clicking a flow shows detailed flow data, including the exact point in the fabric where the drop occurred and why it occurred. This approach saves time when troubleshooting and helps operators quickly identify and locate the specific potential problem points in the infrastructure.
Analytics
Event Analytics Dashboard is a Cisco Nexus Dashboard Insights feature that monitors control-plane events in the infrastructure. It performs these tasks:
Data collection: It monitors configuration changes and control plane events and faults.
Analytics: AI/ML algorithms determine the correlations between all changes, events, and faults.
Anomaly detection: It provides the output of AI/ML algorithms (unexpected or downtime-causing events).
The Event Analytics Dashboard displays faults, events, and audit logs as a sequence of data points arranged chronologically. You can get detailed information on the historical state of any of these data points. Faults, events, and audit logs are correlated to identify if configuration deletion led to a fault.
The Event Analytics Dashboard displays the following metrics:
Audit logs: This metric shows the creation, deletion, and modifications of all device configuration objects. It is useful for identifying recent changes that may be a potential reason for unexpected behavior. It can also help revert changes to a stable state and assign accountability.
Events: This metric shows operational events in the infrastructure, such as IP detach/attach, port attach/detach on a virtual switch, and interface state changes.
Faults: This metric shows issues in the infrastructure (for example, invalid configurations). This function speeds up problem rectification and reduces the time needed for root-cause analysis and remediation.
In addition to summarizing the anomalies and alerts, Cisco Nexus Dashboard Insights allows you to interactively browse, search, and analyze the anomalies and advisory alerts that are generated by the service.
Anomalies are the network that are issues detected, such as the following:
Issues with resource utilization.
Environmental issues like power failure, memory leaks, process crashes, node reloads, CPU, and memory spikes.
Interface and routing protocol issues, such as cyclic redundancy check (CRC) errors, Digital Optical Monitoring (DOM) anomalies, interface drops, Border Gateway Protocol (BGP) neighbor issues, Protocol Independent Multicast (PIM), Internet Group Management Protocol (IGMP) flaps, Link Layer Discovery Protocol (LLDP) flaps, and Cisco Discovery Protocol issues. It also provides a view into microbursts with offending and victim flows.
Flow drop, its location and reason, abnormal latency spikes of flows using hardware telemetry, and direct hardware export.
Endpoint duplicates, rapid endpoint movement, and rogue endpoints.
Issues in the network configuration are detected and reported as change analysis anomalies.
Violations of the compliance requirements for compliance assurance are detected and reported as compliance anomalies.
Issues that are found in the network forwarding analysis and assurance are detected and reported as forwarding anomalies.
Application issues calculated by AppDynamics and Cisco Nexus Dashboard Insights (AppDynamics Integration required).
Nonblocking Lossless Fabric
Network administrators need to be aware of the characteristics of high-performance AI/ML networks, different load-balancing methods, and how they impact AI/ML application performance.
AI applications, especially the training stages, take advantage of and require high and consistent networking performance and lossless transport. To achieve this goal, network administrators need to deploy the correct hardware and software features, and a configuration optimized for AI application needs.
In terms of network topology, aside from good scaling properties, the Clos architecture (illustrated in the following figure) is optimal for AI/ML workloads because of the following features:
Nonblocking fabric: This fabric allows the host to have full-rate connectivity to any other host in the fabric.
Deterministic performance: This fabric allows low and consistent latency between any two hosts, regardless of the network load and traffic patterns.
However, the optimal topology cannot achieve nonblocking transport. Equally important is the method of steering the traffic through the fabric. Consider the following example where two scenarios use the same topology with traditional Ethernet ECMP load balancing, but with completely different behavior. In the first scenario, ECMP distributes the two flows (color-coded) optimally from a leaf switch across both available uplinks toward the spine tier. However, because there is no control over the hashing function, and the hashing function does not consider the link load levels, you could quickly end up in a situation like the one exhibited on the right. Both flows are steered toward the first uplink only, reducing the host bandwidth by half and potentially causing congestion.
Note: The ECMP algorithm calculates the output link for a packet by hashing using the source and destination IP addresses and transport protocol port numbers as input values.

Advanced Ethernet fabrics, such as Ultra Ethernet (as defined by the Ultra Ethernet Consortium) and Scheduled Ethernet support optimal and deterministic fabric-controlled spray-and-reorder load-balancing methods. Ethernet fabric reorders packets on the ingress leaf switch and distribute them based on fabric load to optimally take advantage of all fabric links. The egress leaf reorders packets for each flow before delivering them to destination hosts. As illustrated in the following figure, this approach enables deterministic fabric behavior and achieves high performance independently of traffic characteristics.
Note: Out-of-order packet delivery is not acceptable for RDMA (RoCE), which is the dominant traffic in AI/ML GPU-based clusters.

Silicon One ASIC: Supported Ethernet Modes
Ethernet
Traditional Ethernet with stateless ECMP load-balancing, which is suboptimal
Enhanced Ethernet
Achieves better fabric utilization by considering the load of the links to load balance the traffic. The network maintains its state since flows can only be rebalanced when idle
Ultra Ethernet
Achieve even better fabric utilization than Enhanced Ethernet by using endpoint-controlled adaptive packet spraying
Scheduled Ethernet
Enables fabric-controlled optimal load-balancing with the "spray and reroder" method. It achieves maximum performance by avoiding head-of-the-line (HoL) blocking and incast congestion avoidance
Note: HOL blocking happens when a packet at the front of a queue prevents others behind it from moving forward, even if the subsequent packets could otherwise be processed. Incast-caused congestion happens when multiple input ports are sending the packets to the same output port and overflowing as a result.

Congestion Management
Most common single-training jobs run on hundreds of GPUs. GPU clusters typically consist of thousands of GPUs, so these clusters are shared among multiple AI/ML tenants. Therefore, it is important to know about traffic prioritization, load balancing, and congestion management techniques that are available to increase the performance of AI/ML applications.
Traffic Prioritization
Even though AI traffic is on a dedicated back-end network, proper QoS is highly recommended. Dedicating a queue distinguishes AI traffic from other traffic on the link and reserves scheduling resources. This reduces latency and eliminates potential contention with other traffic. In the following figure, queue 3 is reserved for AI traffic.

Congestion management is enabled on the AI traffic queue; that is, explicit congestion notification (ECN) and priority flow control (PFC) protocols operate on traffic in this queue. As a result, congestion notification packets (CNPs) are generated, and their handling is critical for AI application performance. Unlike RoCE traffic, which is high volume, CNP traffic is much lighter but requires minimal latency. To deliver congestion signaling in time, CNP traffic is placed in a strict priority queue.
Load Balancing
A traditional ECMP algorithm load balances traffic based on source and destination IP addresses and transport protocol ports. However, AI cluster hosts usually belong to the same subnet, which is a problem. Also, the destination Layer 4 port is always the same, 4791, for RoCEv2 traffic. Therefore, the entropy comes only from the Layer 4 source port. Tuple uniqueness leads to hash collisions of multiple flows on the same link, which causes traffic polarization on the uplinks and uneven fabric utilization. Minimal flows and multihop collisions further aggravate the problem.
One solution would be to use higher-bandwidth links. However, this approach does not address the root cause and is not economical because much of the fabric bandwidth is unused.
The more feasible solution would be to use an enhanced ECMP algorithm, such as the following:
User-defined field (UDF): UDF-based load balancing uses an additional field from packet headers. Because every RoCE connection is identified with a unique Destinate Queue Pair field (in the InfiniBand header), using this field for hashing achieves load balancing on a flow level.
Dynamic load balancing (DLB): DLB load balances on a flow level based on path congestion. It is useful for heavy flow load balancing.
Note: Enhanced Ethernet uses a form of the DLB algorithm.
Enhanced ECMP algorithms perform better than traditional ECMP load balancing but are still suboptimal. Uneven link utilization is still possible because the balancing is done on the flow level without packet reordering, and AI flows are bandwidth-heavy. Optimal load balancing and ultimate Ethernet fabric performance can be achieved with Scheduled Ethernet, which is used to build Disaggregated Scheduled Fabrics (DSFs).
DSF enables packet reordering to achieve optimal traffic balancing across all the links, with all flows treated fairly. The ingress leaf switch reorders the packets of each flow, and the egress switch reassembles the flow in proper order. This Cisco Silicon One ASIC hardware-based approach completely avoids flow collisions because it does not use hashing.
Note: In literature, Scheduled Ethernet and DSF are sometimes used interchangeably. Scheduled Ethernet refers to how an Ethernet fabric operates, whereas DSF refers to the architecture for building scalable, nonblocking Ethernet fabrics.
Congestion Control
RDMA (RoCE) traffic expects a lossless fabric to maintain adequate AI/ML application performance, but congestion can still happen, and its impact must be managed to avoid excessive packet drops. Congestion control protocols perform this function and are implemented differently across Ethernet modes:
Ethernet: Traditional Ethernet uses regular ECN and PFC protocols to react to congestion events.
Enhanced Ethernet: This mode allows faster and more granular reaction to congestion events by using network telemetry based on congestion score metric in Cisco Nexus Dashboard Insights.
Ultra Ethernet (as defined by the Ultra Ethernet Consortium): This mode performs network-influenced congestion management based on link utilization.
Scheduled Ethernet: Congestion events are avoided by using the request-grant mechanism—each virtual output queue (VOQ) requests a grant before the packet is sent to the output port.
Note: VOQ is a coupled ingress-egress virtual output forwarding architecture to manage traffic inside the network device. A separate queue is maintained at the ingress port for each egress port. This approach avoids the common problem of head-of-the-line blocking where output congestion for some packets in an input queue blocks the rest of the packets in the same input queue that is destined for other uncongested outputs.
ECN is negotiated end-to-end between hosts in advance during connection establishment. If congestion builds up, the receiver instructs the sender to slow down by sending a congestion notification packet (CNP). In severe congestion where the sender does not slow down fast enough, PFC acts as a fail-safe for ECN. PFC sends pause frames from receiver to sender on a hop-by-hop basis.

Note: ECN and PFC are good congestion management mechanisms, but they do have some downsides, such as congestion propagation, PFC storms causing deadlocks, and reliance on timeouts.
Scheduled Ethernet congestion management is much simpler than manually configuring PFC and ECN on each node. Congestion avoidance is hardware-based, and only the host-leaf connection needs manual configuration.
Last updated