Congestion Visibility
Explicit Congestion Notification
ECN and PFC play important roles in building a lossless and high-performance Ethernet network. In this topic, you will explore how the ECN mechanism operates.
The ECN feature is used between two ECN-enabled endpoints to signal potential congestion so that the sender reduces the traffic transmission rate. An ECN-capable network node uses a congestion avoidance algorithm to check the amount of the queue used. After reaching a specified threshold, the traffic contributing to congestion will be marked. In this example, weighted random early detection (WRED) indicates congestion and marks traffic with ECN bits.
Note: Instead of WRED, the approximate fair dropping (AFD) technique can be used to enable flow-level granularity.
Cisco Nexus 9000 Series Switches can mark packets with ECN bits in network congestion. WRED on Cisco Nexus 9000 Series Switches is done at a per-queue level. In the queue, two thresholds are set. The WRED minimum threshold is set to a lower buffer utilization level and indicates a minor congestion event. If buffer utilization reaches the minimum threshold, WRED marks several outgoing packets leaving the queue. The number of packets depends on the drop probability value in the WRED configuration on the Cisco Nexus 9000 Series switch, which is represented as a percentage of all outgoing packets.
For example, if the drop probability parameters are set to 10, it signifies that 10 percent of all outgoing packets will be marked. If this action does not reduce the congestion, queue buffer utilization will continue to grow and reach the WRED maximum threshold. After crossing the WRED maximum threshold, the switch marks ECN bits on every outgoing packet.

ECN is marked on the network node where congestion is experienced using the two least significant bits of the type of service (ToS) field in the IP header. When a receiver gets a packet with the ECN congestion experience bits set to 0x11, it generates and sends a congestion notification packet (CNP) back to the sender.
0x00
Non-ECN capable
0x10
ECN-capable transport (0)
0x01
ECN-capable transport (1)
0x11
Congestion encountered
This WRED ECN mechanism informs endpoints about congestion. Endpoints can use this information to slow down the rate of sending traffic. In the following example, you have a two-tier network. Host A and Host B are sending data to Host X.

Congestion occurs on Leaf X, as there is more bandwidth on the spine-leaf link than on the leaf-host link, and the port that is connected to Host X is oversubscribed. Buffer utilization will start increasing on Leaf X. When it reaches the WRED minimum threshold, Leaf X starts marking the ECN bits of several packets with a 0x11 value to indicate congestion in the data path.

When the destination (Host X) receives the marked packet, it understands that congestion is happening in the data path and sends a CNP packet to the source host. The CNP packet informs the source host that congestion is happening so that it can reduce the traffic transmission rate.
Because only several packets were marked with congestion-experienced bits, the source reduces traffic throughput for that flow and continues to send packets. If the congestion continues and the buffer utilization increases over the WRED maximum threshold, the switch starts marking every packet with ECN bits. The sender begins to receive many CNP packets, and based on its algorithm, it should drastically reduce the data rate transmission toward the destination. This action mitigates congestion, and the buffer utilization should start decreasing. After some time, the sender will gradually increase the traffic rate until the congestion happens again, and the whole process repeats.
Priority Flow Control
PFC is a mechanism that prevents packet loss when congestion happens. When PFC is enabled on the Cisco Nexus 9000 Series switch, a class of service (CoS) is reserved for lossless transport. Traffic in this class is treated differently than traffic in other classes. This approach enables flow control on a per-priority basis, simultaneously allowing lossless and lossy traffic priorities on the link. Any port on the Cisco Nexus 9000 Series switch that is configured with PFC is allocated a dedicated no-drop queue and a dedicated buffer for that queue.
Properly sized buffers are important for the optimal functioning of a lossless network. Buffer headroom is needed to accommodate packets currently in the network and fluctuations in the network traffic, such as microbursts. To provide lossless capabilities, the queue has two thresholds. The XOFF threshold is set at a higher buffer utilization level, where a PFC frame is generated and sent toward the source of the traffic. When the buffer utilization is below the XON threshold, pause frames are halted and no longer sent toward the senders. At this point, there is no more congestion.





On rare occasions, a PFC storm can occur when a malfunctioning host continuously transmits PFC frames. This behavior can cause the buffers on all the network nodes to overflow, and after it reaches all the end hosts, it can completely stop the network. The PFC watchdog feature on Cisco Nexus 9000 Series Switches will help you prevent PFC storms. A PFC watchdog interval is configured to detect whether packets in a no-drop queue are drained within a specified time period. If the time period is exceeded and packets are not drained, all outgoing packets are dropped on interfaces that match the PFC queue that is not being drained. This approach prevents the continuous PFC frames from reaching the sender and causing a network stall.
Congestion Visibility in AI/ML Cluster Networks Using Cisco Nexus Dashboard Insights
Cisco Nexus 9000 Series Switches have the required hardware and software capabilities to provide the latency and congestion management mechanisms to meet the requirements of AI/ML applications. They also have rich monitoring and telemetry features to enable visibility into network performance. Cisco Nexus Dashboard Insights features and capabilities help you monitor and optimize congestion and performance.
Cisco Nexus Dashboard Insights
To help network administrators optimize their AI/ML network for best performance and predict issues before they become service-impacting, the network system must provide a deep level of visibility into the congestion management algorithms and the overall network system health. The Cisco Nexus 9000 Series Switches come with powerful built-in telemetry capabilities that can be used to correlate issues in the network and help optimize it for Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) version 2 (RoCEv2 transport.
The power of telemetry and insights allows you to monitor and optimize AI/ML fabrics in real time. With Cisco Nexus Dashboard Insights, you can monitor the following:
Fabric performance (latency, utilization, packet drops)
Lossless Ethernet statistics (ECN, PFC)
Congestion score
Latency score
Perform real-time troubleshooting
The Cisco Nexus 9000 Series Switches provide hardware flow telemetry information using a flow table and flow table events. The flow table collects full flow information and metadata, such as the following:
5-tuple flow info
Interface and queue info
Flow starts and stops time
Packet drop indicators
Burst measurement
With these features, every packet traversing the switch can be accounted for, observed, and correlated with behavior such as microbursts or packet drops. You can export this data to Cisco Nexus Dashboard Insights and show the data per device, per interface, and at a flow level.
Congestion and Performance Monitoring with Cisco Nexus Dashboard Insights
The Cisco Nexus Dashboard Insights uses a set of advanced alerting, baselining, correlating, and forecasting algorithms to provide deep insights into the network's behavior by utilizing telemetry data that are obtained from networking and compute components. Cisco Nexus Dashboard Insights automates troubleshooting and helps with rapid root-cause analysis and early remediation. Also, it simplifies audits and ensures compliance by using a unified network repository and compliance.
Cisco Nexus Dashboard Insights monitors lossless Ethernet fabric properties and provides ECN mark counters on a per device and per interface basis, and at a flow level. It can also report information about PFC packets that are issued or received by a switch on a per-CoS level. With this information, a network administrator can observe real-time network congestion statistics and use them to tune the network to better respond to congestion.
With the granular visibility provided by Cisco Nexus Dashboard Insights, the network administrator can observe drops and tune WRED or AFD thresholds until drops stop in normal traffic conditions. This step is the first and most crucial to ensure that the AI/ML network will handle regular traffic congestion occurrences effectively. Also, along with tuning the thresholds, network administrators can enable the PFC feature to achieve a completely lossless behavior. After drops are prevented, you can use ECN markings and PFC receive and transmit counter reports to tune the system to achieve the best performance.
The Traffic Analytics feature enables you to monitor your network’s latency, congestion, and drops. Traffic Analytics automatically discovers services in your network by matching well-known Layer 4 ports to their corresponding service endpoint categories.
Cisco Nexus Dashboard Insights then assesses service performance based on the thresholds that you define for the following metrics:
Latency: This metric measures the overall time (microseconds) it takes a packet to go from one place to another.
Congestion: This metric measures network bandwidth utilization and quality of service (QoS) activation mechanisms to determine if a service is experiencing network congestion.
Drops: This metric measures the percentage of dropped versus transmitted packets considering factors such as cyclic redundancy check (CRC) errors, faulty cables and other devices.
An anomaly is raised if performance metrics, such as latency, congestion, or drops, deviate from normal. The performance score is calculated for each conversation and aggregated to the Service Endpoint or Endpoint level to raise anomalies.
The performance score is calculated based on the following:
Congestion: This score is a calculation of consistent congestion avoidance between endpoints.
Latency: This score indicates deviation from the measured baseline.
Drops: This score indicates the number of packet drops.
The Traffic Analytics feature allows you to accomplish the following tasks:
Monitor traffic extensively, including congestion and latency score.
Report performance issues using anomalies raised for performance metrics.
Sort top active services and clients and determine the most active endpoints in the system.
Determine the sync (SYN flag in TCP header) or reset (RST flag in TCP header) counts per service.
Troubleshoot conversations or flows on-demand.
Pipeline Considerations
AI/ML applications run in the production environment as workflows consisting of multiple stages. Stages have different resource and performance requirements for data center infrastructure. Therefore, data center infrastructure capabilities and capacity significantly impact the performance of AI/ML workflows. Let's look at an interaction between AI/ML application requirements and data center infrastructure.
AI/ML Pipeline Overview
AI/ML workflow has two main stages: training and inferencing. You first need to build and train your model and fine-tune it, then deploy it on production hardware to make decisions or predictions on new data. You must also continuously evaluate and retrain your model.
The model training focuses on learning from huge data sets and adjusting parameters to minimize errors. The training phase involves quite a few steps:
Data collection
Preprocessing
Feature engineering
Model selection
Training
Evaluation
Hyperparameter tuning
Final model training
Inferencing involves the following steps:
Model deployment
Data input
Prediction
Postprocessing
Evaluation
Monitoring
Establishing a feedback loop
Generative AI large language models (LLMs) require significant resources that are usually available only to hyperscale users, so you will not develop and train your own models. Instead, you will use predeveloped external models and fine-tune them for your specific use case.
Infrastructure Impact on AI/ML Performance
AI/ML clusters generally comprise many specialized nodes, with graphical processing units (GPUs) and other hardware accelerators bound together in a specialized network. The algorithms that run on these GPUs are computationally intensive and perform calculations across huge datasets, often larger than the memory available on a single GPU. The job is split across multiple GPUs to distribute the load, and the cluster performs an iterative set of calculations on the data set. Each GPU performs a smaller portion of the calculation and sends the results to all its peers in a transmission process that is known as the All-to-All collective. After the transmission, a barrier operation (synchronization) occurs, which stalls all the GPUs while waiting for all the data to be received.
This barrier operation makes the whole process extremely sensitive to network performance. If even one slow path exists in the network, all the GPUs will stall, waiting for that one transmission to complete. This situation is known as the tail latency of the job.
Note: The time that it takes from the start of the transmission to the time that all GPUs receive their results is the job completion time (JCT). The JCT is used as a critical measure of AI performance. Poor load balancing and packet drops can cause high tail latencies, resulting in poor JCT.
In real life, AI/ML clusters run many simultaneous and independent jobs over the same network. As more jobs execute independently, the job-to-job interference increases. As network congestion increases, tail latency increases. This progression is a normal event in traditional networking. Unfortunately, in AI/ML networks, the synchronization component makes the impact of such tail latency significantly greater.
AI/ML Pipeline Impact on Infrastructure
Training is the most demanding stage in terms of the hardware resources needed. It requires a high-throughput environment for I/O and networking. It also requires high computing resources, often involving large networks with many GPU or CPU hosts. The process is typically longer and operates offline, usually on-premises. Training traffic patterns are bursty and require parallel processing across multiple GPU nodes.
Inference uses the trained model for predictions. It needs lower compute resources with smaller networks and mid-sized GPU or CPU hosts. Predictions are made quickly, which supports real-time online operations across on-premises, cloud, and edge devices. Therefore, inferencing is more latency-sensitive because it makes real-time decisions based on queries or user prompts.
Purpose
- Model learning from data - Adjusting parameters to minimize training data error
Using trained model for prediction
Compute and Resources
High, Large network with many GPU and CPU hosts
Low, Smaller network with mid-sized GPU and CPU hosts
Duration
Longer Duration
Quick Predictions
Operation Model
Offline, On-Premises
Online Realtime, On-premises, Cloud, Edge Devices
Traffic Patterns
Bursty Traffic
Consistent Traffic
Scalability and network profiles
Parallel processing across GPU's data nodes
Handling Numerous Requests Simultaneously
Data Center Implications
Bandwidth,Traning Time
High Availability, Latency
The primary data center requirements to support the training phase are high-bandwidth capacity and network availability for longer time periods. The inferencing phase requires high availability and low latency to ensure efficient real-time processing. These differences highlight the characteristics of AI/ML workloads and infrastructure requirements and emphasize the importance of customized resource allocation and management strategies in data centers.

Note: Ranking AI/ML models are used to order items according to user-defined criteria. Common applications are search engines and recommendation systems.
LLMs can understand and generate human-like text. Common applications are language translation, text summarization, and more.
Last updated