Building Lossless Fabric
Traditional QoS Toolset
The traditional Cisco QoS toolset includes various features and techniques to manage and prioritize network traffic to ensure optimal performance for critical applications. The following are the key mechanisms of the QoS toolset:
Classification and marking
Congestion Management
Traffic Policing and Shaping
Without QoS, a device offers the best-effort service for each packet, regardless of the packet content or size. The device sends the packets without assuring reliability, latency, or throughput. With the QoS configuration, you can enable priority handling and reserve resources for specific application traffic. A proper end-to-end QoS configuration provides the following benefits:
Improved performance: Critical applications receive the necessary resources, which improves performance and reliability.
Enhanced user experience: Latency and packet loss are reduced for real-time applications such as voice and video.
Efficient bandwidth utilization: Network congestion is prevented and makes better use of available bandwidth.
Implementing QoS mechanisms and support for specific features depends on the platform and model. The following figure presents an implementation overview of supported QoS features for Cisco Nexus 9000 data center switches.
Classification based on: ACL, DSCP, CoS and IP Preference
Marking traffic with: DSCP, COS and IP Precedence
Policing: Ingress and Egress
Buffering and Queuing: Shared Egress Buffer, 8 Egress Queue
Scheduling: Strict Priority Queuing and DWRR
Shaping: Egress per Queue Shaper
Congestion Avoidance: Tail Drop and WRED with ECN
Note: The deficit weighted round robin (DWRR) scheduling algorithm ensures bandwidth distribution according to a weight parameter, which defines the portion of link bandwidth that is available to that queue.
Enhanced Transmission Selection
ETS is a bandwidth management technique from the Cisco Data Center Bridging (DCB) suite and is outlined in the IEEE 802.1Qaz standard. ETS enables optimal bandwidth management across different virtual links and prioritizes specific traffic classes while simultaneously providing a fair bandwidth allocation to all traffic classes. ETS manages bandwidth by allowing fair bandwidth access to all traffic classes, preventing congestion, and improving overall network efficiency, and you can provide guaranteed minimum bandwidth to certain traffic classes, such as storage or high-performance computing (HPC). When the traffic load does not fully use its allocated bandwidth, the remaining bandwidth is available to other classes, which helps accommodate classes with bursty traffic.

Intelligent Buffer Management on Cisco Nexus 9000 Series Switches
Modern data center traffic consists of short flows, often called mice flows, and long flows, often called elephant flows. In a queue, switches must treat mice and elephant flows differently. However, traditional data center switches with simple buffer management cannot differentiate mice and elephant flows. First in, first out (FIFO) queue scheduling does not prevent mice flows from experiencing long queuing latency during congestion or tail drops upon buffer overflow.

Random early detection (RED) and WRED are active queue-management mechanisms. They are used to avoid congestion by using proactive packet drops to trigger TCP congestion control on the sender, which then slows down the traffic transmission rate. However, RED and WRED randomly drop packets in a queue when the queue length exceeds the congestion threshold. This process can also affect mice flows that carry important control traffic, which leads to performance degradation.
Note: The distribution of flow lengths in data center networks is such that a few elephant flows carry most of the data traffic. Mice flows add bursts to the overall traffic load. Elephant and mice flows perform different application jobs, and so have different requirements for bandwidth, packet loss, and network latency.
Traditional switches must buffer all packets at a congested link to avoid packet loss in mice flows, because they cannot selectively buffer mice flows. For this reason, some new switch platforms are built with deep buffers to help ensure ample buffer space. However, this approach has significant disadvantages:
The cost of a switch with greater buffer space is significantly higher.
The extended queue depth causes longer queuing latency. The latency will increase the flow completion time for all data flows, thus degrading overall application performance.
Note: TCP tries to consume the maximum available bandwidth on an end-to-end traffic path, which can consume all the buffers on the links. Unless ECN is enabled, TCP uses packet drops as implicit congestion notifications to trigger its congestion-control mechanism on the sender to reduce packet transmission. If no packet drops are detected, a TCP sender can continuously increase its transmit rate, which can eventually cause buffer overflow on the congested link. When this situation occurs, new mice flow arriving at the queue will be subject to tail drop.
The Cisco Nexus 9000 Series Switches with Cisco cloud-scale ASICs are built with a moderate amount of on-chip buffer space to achieve maximum throughput and intelligent buffer management functions to handle both mice and elephant flows efficiently. The Cisco concept of innovative, intelligent buffer management is critical and is based on the ability to distinguish mice and elephant flows and apply different queue management techniques based on their network forwarding requirements in the event of link congestion. This process allows both elephant and mice flows to achieve the best performance, improving the overall application performance.

Note: The Cisco Nexus intelligent buffer management capabilities allow you to have smart and appropriately sized buffers.
The intelligent buffer management capabilities are built into Cisco cloud-scale ASICs for hardware-accelerated performance and include:
Approximate fair dropping (AFD) with elephant trap (ETRAP). AFD preserves buffer space to absorb mice flows, particularly microbursts and aggregated mice flows, by limiting the buffer use of bandwidth-intensive elephant flows. It also ensures bandwidth fairness among elephant flows. ETRAP is a technique that detects elephant flows.
DPP allows the separation of mice and elephant flows into two different queues so that buffer space can be distributed independently, and different queue scheduling can be applied to them. For example, mice flows can be mapped to a low-latency queue, and elephant flows can be sent to a weighted fair queue.
Note: AFD and DPP can be deployed separately or together. They complement each other to deliver the best application performance. Also, a buffer admission algorithm called Dynamic Buffer Protection (DBP) ensures fair access to the available buffer for all queues.
AFD with ETRAP
AFD is an active queue-management scheme that ensures fair bandwidth allocation between flows.
Fairness has two aspects:
AFD preserves buffer space for handling mice flows by limiting the buffer use of bandwidth-intensive elephant flows. Elephant flows usually have a much longer session duration than mice flows.
AFD tracks elephant flows and applies the AFD algorithm in the egress queue to allocate a fair share of bandwidth to them.
AFD uses ETRAP to distinguish long-lived elephant flows from short-lived mice flows. Mice flows are excluded from the dropping algorithm so they receive a fair share of bandwidth, and bandwidth-intensive elephant flows do not consume all the bandwidth. This concept is illustrated in the following figure.

AFD with ETRAP uses these principles:
A flow may be defined using multiple parameters, but typically the five-tuple (source and destination IP address, source and destination port number, transport protocol) is used.
ETRAP operates on the ingress side of a switch. It measures the byte counts of incoming flows and compares these counts against the ETRAP threshold.
Flows with a byte count lower than the threshold are mice flows.
A flow with a byte count higher than the ETRAP threshold becomes an elephant flow and is moved to the elephant table for tracking.
You can configure the ETRAP threshold to define what an elephant flow is in your data center environment.
The elephant table stores and tracks elephant flow arrival rate and activity. The measured data rates are passed to the buffer management mechanism on egress queues, where the AFD algorithm uses the rates to calculate the probability of drops for each flow. Elephant flows are timed out if they do not remain at a certain level of activeness. When an elephant flow’s average bandwidth during the time period is lower than the configured bandwidth threshold, it is considered inactive and will be timed out and removed from the elephant flow table.
Note: A user-configured age-period timer and a bandwidth threshold are used to evaluate the activeness of an elephant flow.
Before AFD, WRED was the primary technology used for the same or similar purposes. WRED applies weighted random early discard to class-based queues but does not have flow awareness within a class. WRED has the following disadvantages:
All packets, including packet-loss-sensitive mice flows, are subject to the same drop probability.
Although elephant flows can use drops as congestion signals to slow down the traffic sending rate, drops can negatively affect mice flows.
The same drop probability causes elephant flows with a higher rate (due to short round-trip time [RTT]) to get more bandwidth. Therefore, egress bandwidth is not evenly divided among elephant flows traversing the same congested link.
As a result, the flow completion time for mice flows increases, and elephant flows do not have fair access to the link bandwidth and buffer resources. In contrast, AFD considers the flow sizes and data arrival rates before making a drop decision. The dropping algorithm is designed to protect mice flows and provide fairness among elephant flows.
AFD does not distinguish between transport protocol types, such as TCP and UDP. Unlike TCP, UDP does not have a built-in congestion management algorithm. AFD with ECN marking will only achieve congestion avoidance if the UDP-based application is ECN-aware, which is usually not the case. Therefore, AFD should not be enabled on queues that carry non-TCP traffic.
Note: Because AFD is configurable per queue, it is better to classify traffic by protocol and ensure that traffic from high-bandwidth UDP-based applications always uses a non-AFD-enabled queue.
The purpose of AFD technique: It preserves buffer space to absorb mice flows. DPP allows separating mice and elephant flows into two different queues so that buffer space can be distributed independently, and different queue scheduling can be applied to them. AFD preserves buffer space to absorb mice flows, particularly microbursts and aggregated mice flows, by limiting the buffer use of bandwidth-intensive elephant flows. ETRAP is a technique that detects elephant flows. Active queue-management mechanisms such as RED are used to avoid congestion by using proactive packet drops to trigger TCP congestion control on the sender, which then slows down the rate of sending traffic.
Dynamic Packet Prioritization
DPP queues mice flows and elephant flows in the same traffic class separately. It isolates them in two separate queues even though they belong to the same traffic class. This separation is impossible in traditional queue management technologies because they lack flow awareness within a class.
DPP tracks flows to distinguish mice flows from elephant flows on the ingress port and then applies separate queuing policies to mice and elephant flows on the egress port. DPP operates using the following principles:
On the ingress port, flows are identified based on their five-tuples, and their initial packet counters are monitored.
Packets in a flow are then classified into the mice QoS group or the original QoS group based on the flow size.
Within the DPP mechanism, a maximum-packet counter is used to determine the classification.
If the DPP maximum-packet counter is defined as N, the first N packets in every flow are classified into the mice QoS group.
The N+1st packet and onward are classified into the original QoS group of the traffic class to which the flow belongs.
Essentially, a new flow always starts as a mice flow until it sends more packets than the value of the DPP maximum-packet counter. The following figure illustrates how the DPP mechanism distinguishes mice from elephant flows.

After DPP differentiates mice and elephant flows into separate QoS groups on the ingress port, different queuing controls can be applied to them on the egress port to meet the different forwarding requirements for mice and elephant flows. The mice flow queue should be a priority queue, whereas the regular queue for elephant flows is a weighted queue with bandwidth that is allocated for these flows.
The purpose of the DPP technique: It distributes the buffer space independently for mice and elephant flows. DPP allows separating mice and elephant flows into two different queues so that buffer space can be distributed independently, and different queue scheduling can be applied to them. AFD preserves buffer space to absorb mice flows, particularly microbursts and aggregated mice flows, by limiting the buffer use of bandwidth-intensive elephant flows. ETRAP is a technique that detects elephant flows. Active queue-management mechanisms, such as RED, are used to avoid congestion by using proactive packet drops to trigger TCP congestion control on the sender, which then slows down the rate of sending traffic.
Data Center Bridging Exchange
The Data Center Bridging Exchange (DCBX) protocol can discover and exchange priority and bandwidth information between endpoints. DCBX simplifies management by allowing configuration and distribution of parameters from one node to another, ensuring that both ends of a link have a consistent configuration.

DCBX is implemented as an extension of Link Layer Data Protocol (LLDP) with new type, length, value (TLV) fields. The following parameters of data center Ethernet features can be exchanged and synchronized between the two nodes:
ETS
PFC
CoS values
Congestion notification
Logical link down
Note: TLV is a generic and flexible data encoding scheme that is used in communication protocols to extend specific protocol capabilities easily. It consists of three parts: type, length, and value.
Lossless Ethernet Fabric using RoCEv2
For RDMA over Converged Ethernet version 2 (RoCEv2) transport, the network must provide high throughput and low latency while avoiding traffic drops in situations where congestion occurs. Because packet drops slow down the AI/ML application training, RoCEv2 requires lossless transport. The lossless network can be achieved using ECN and PFC congestion avoidance algorithms. Cisco Nexus 9000 Series Switches support PFC congestion management and ECN marking with WRED or AFD to indicate congestion in the network node.
Exclit Congestion Notification
In situations where congestion information needs to be propagated end-to-end, ECN can be used for congestion management. ECN is marked in the two least significant bits inside the IP header type of service (ToS) field of the network node where congestion is experienced.
When a receiver gets a packet with the ECN congestion experience bits set to 0x11, it generates and sends a congestion notification packet (CNP) back to the sender
When the sender receives the congestion notification, the flow that matches the notification slows down. This end-to-end process is built into the data path and, as such, is an efficient way to manage congestion.
Priority Flow Control
Priority Flow Control was introduced in Layer networks as the primary mechanism to enable lossless Ethernet. Flow control was driven by class of service (CoS) value in the layer 2 frame, and congestion is signaled and managed using pause frames, and a pause mechanism. However, building scalable Layer 2 networks can be challenging for network administrators. Because of this, network designs have mostly evolved into layer 3 routed fabrics.
To support traffic routing, PFC was adjusted to work with differentiated services code point (DSCP) priorities to signal congestion between routed hops in the network. Using Layer 3 marking enables traffic to maintain classification marking acros routers.
Since PFC frames use local link addresing, the network devices can receive and perform pause signaling for routed and switched traffic. PFC is transmitted per hop, from the place of congestion to the traffic source. PFC is used as the primary tool to manage congestion for RoCEv2 transport.
Build Lossless Ethernet Networks with ECN and PFC
Both ECN and PFC can manage congestion very well by themselves. Working together, they can be even more effective. ECN can react first to mitigate congestion. If ECN does not react fast enough and buffer utilization continues to increase, PFC behaves as a fail-safe and prevents traffic drops. This technique is the most efficient way to manage congestion and build lossless Ethernet networks. This collaborative process between PFC and ECN is called Data Center Quantized Congestion Notification (DCQCN) and was developed for RoCE networks.
Together, PFC and ECN provide efficient end-to-end congestion management:
When the system experiences minor congestion with moderate buffer usage, WRED with ECN manages the congestion seamlessly.
In cases where congestion is more severe or caused by microbursts that produce high buffer usage, PFC is triggered to manage the congestion.
Note: ECN and PFC must be configured end-to-end across the entire data center network.
Advanced Congestion Management with AFD
The recommended approach for RoCE networks is to use ECN and PFC together. You can also enhance congestion management by using advanced QoS algorithms such as AFD on Cisco Nexus 9000 Series Switches to distinguish high-bandwidth flows (elephant flows) from short-lived and low-bandwidth flows (mice flows).
AFD can be combined with ECN as an alternative to WRED. By default, AFD applies early packet drops to elephant flows based on the dropping possibility, which is calculated by the AFD algorithm. The packet drops serve as implicit congestion notifications to the source hosts. If iyou want to avoid packet drops or you are dealing with UDP (which does not slow down), AFD can work with ECN to send notifications without dropping packets. This AFD and ECN collaboration works as follows:
Based on the marking possibility calculated by the AFD algorithm, the switch can mark the ECN Congestion Experienced (CE) bit in packets.
This marking is done only for high-bandwidth flows. A different number of packets are marked with ECN based on the bandwidth that the flow uses.
After that, the packets will continue to be transmitted toward the destination host instead of being dropped.
The destination host will return this congestion indication to the sender by generating a CNP.
When the source host receives the CNP, its transmission window will be reduced because of a packet drop.
This process allows AFD and ECN to provide the intended functions, including fair bandwidth allocation and network congestion mitigation, which optimizes performance for the lowest latency without dropping packets.
WRED marks all traffic in a queue equally. The advantage of AFD over WRED is its ability to distinguish and slow down the flows that are causing the most congestion. AFD is more granular and marks only the higher-bandwidth elephant flows while leaving the mice flows unmarked. This approach ensures that mice flows are not penalized, which would cause them to slow down.
Note: In an AI cluster, it is best to let short-lived communications run to completion by not allowing a long data transfer and any resulting congestion to slow them down. Packet drops are still avoided, and many transactions are completed faster because the system can tell elephant flows apart and only slows them down.
Last updated