Compute Resources

Compute Hardware Overview

  • Enterprises in all industries are starting to see how powerful AI/ML can be. Data scientists use large data sets to train AI/ML data models. Trained data models have a wide range of use cases across various industries and applications.

  • To address these use cases, Cisco offers different platform options, including modular, rack, converged, hyperconverged, and edge solutions for inference and model training, as illustrated in the following figure.Alt text

  • These platforms include the following:

    • Cisco UCS C-Series Rack Server

    • Cisco UCS X-Series Servers

    • Cisco UCS Converged Infrastructure FlashStack

    • Cisco UCS Converged Infrastructure FlexPod

    • Cisco UCS Hyperconverged Infrastructure

  • Unified Management, enabled by Cisco Nexus Dashboard and Cisco Intersight, eliminates silos and delivers a consistent operational model across enterprise apps and AI from data center to edge. Cisco Intersight is a cloud-based management platform that uses analytics to deliver proactive automation and support. Combining Cisco Nexus Dashboard and Cisco Intersight can reduce costs and resolve issues more quickly.

  • In addition to that framework, virtualization, enabled by OpenShift and Kubernetes, enables infrastructure abstraction and AI management tools, such as PyTorch and NVIDIA AI. These AI tools enable AI frameworks, such as NVIDIA NGC. NVIDIA NGC offers a collection of cloud services, including NVIDIA NeMo, BioNemo, and Riva Studio for generative AI and the NGC Private Registry for securely sharing proprietary AI software.

  • Generative AI, for example, enables text-to-image generation, realistic voice synthesis, and the creation of scientific materials. However, using the full potential of these models requires a robust and optimized infrastructure designed for such a use case.

Intel Xeon Scalable Processor Family Overview

  • Intel Xeon Scalable processors are designed to meet various computing needs, whether for empowering solid foundations for AI workloads and high-performance computing (HPC), supporting critical workloads at the edge, or building a secure cloud. They offer optimized performance, scale, and efficiency across various data center, edge, and workstation workloads.

  • The latest generation of Intel Xeon Scalable processors has built-in accelerators and featured technologies that help optimize workload-specific performance, accelerate AI capabilities, reduce data center latency, reduce data bottlenecks, and balance resource consumption. Intel Accelerator Engines are purpose-built integrated accelerators on Intel Xeon Scalable processors that deliver performance and power efficiency advantages across today’s fastest-growing workloads.

  • The integration of 5th-generation Intel Xeon processors with Cisco UCS X-Series ensures seamless pairing and provides a wide range of options and scalability in performance.

  • These processors deliver impressive performance-per-watt gains across all workloads, with higher performance and lower TCO for AI, databases, networking, storage, and HPC. They are software- and platform-compatible with the 4th-generation Intel Xeon processors, so you can minimize testing and validation when deploying new systems for AI and other workloads.

  • Some of the key features of 5th-generation Intel Xeon Scalable processors for running AI workloads include the following:

    • Built-in AI accelerators on every core.

    • Intel Advanced Matrix Extensions (Intel AMX) for an enhanced deep-learning inference and training performance.

    • Intel AI software suite of optimized open-source frameworks and tools.

    • Out-of-the-box AI performance and end-to-end productivity with more than 300 validated deep-learning models.

    • Higher core count for faster compute processing with better scalability for training and inferencing parallel tasks.

Intel AMX

  • Intel AMX enables Intel Xeon Scalable processors to boost the performance of deep-learning training and inferencing workloads by balancing inference, which is the most prominent use case for a CPU in AI applications, with more training capabilities.

  • Primary benefits of Intel AMX include the following:

    • Improved performance

      • CPU-based acceleration can improve power and resource utilization efficiencies, giving you better performance for the same price.

    • Reduced TCO

      • Intel Xeon Scalable processors with Intel AMX enable a range of efficiency improvements that help reduce costs, reduce TCO, and advance sustainability goals.

      • Integrated Intel AMX accelerator removes the cost and complexity that are typically associated with adding a discrete accelerator.

    • Reduced development time

      • To simplify deep learning application development, Intel works closely with the open-source community, including the TensorFlow (Intel Extension for TensorFlow [ITEX]) and PyTorch (Intel Extension for PyTorch [IPEX]) projects, to optimize frameworks for Intel hardware and upstream Intel’s newest optimizations and features so they are immediately available to developers. This simplification enables you to take advantage of the Intel AMX performance benefits by adding a few lines of code and reducing overall development time.

  • As illustrated in the following figure, Intel AMX are used in frameworks and data analytics at scale in the form of libraries for AI and optimizations for deep learning frameworks. Data analytics at scale refers to the capacity to handle, process, and analyze a massive amount of data in a scalable, efficient, and cost-effective way.Alt text

Intel Extension for PyTorch

  • Intel collaborated with PyTorch to develop IPEX, an open-source extension that optimizes deep learning performance on Intel processors. Many of the optimizations will eventually be included in future PyTorch mainline releases, but the extension allows PyTorch users to get up-to-date features and optimizations more quickly. In addition to CPUs, the IPEX will also include support for Intel GPUs soon. IPEX provides many specific optimizations for large language models (LLMs).

DeepSpeed

  • DeepSpeed is a deep learning optimization software package for scaling and speeding up deep-learning training and inference. DeepSpeed inference refers to the feature set implemented to speed up the inference of transformer models. It initially supported only Compute Unified Device Architecture (CUDA; NVIDIA) GPUs. Support for CPUs, specifically 4th-generation Intel Xeon Scalable processors, was recently added. The currently implemented features include automatic tensor parallelism (AutoTP), bfloat16 and int8 datatype support, and binding cores to rank.

  • The primary benefits of DeepSpeed are as follows:

    • Supports the use of CPUs as accelerators.

    • Can run on CPUs without changing the model code.

    • LLM code can be written without hardware-specific code.

    • Can use the CPU as an accelerator if IPEX is installed in the environment.

    • Highly optimized for CPU inference and training.

Cisco UCS C-Series Rack Servers

  • Cisco UCS C-Series Rack Servers deliver unified computing in a rack-mount form factor. The Cisco UCS C-Series Rack Server family offers an entry point into unified computing and provides the flexibility for standalone management or can be integrated into a Cisco UCS-managed environment.

  • The following are some of the Cisco UCS C-Series Rack Servers that are suitable for AI/ML:

    • Cisco UCS C220 M7 Rack Server

    • Cisco UCS C240 M7 Rack Server

    • Cisco UCS C245 M8 Rack Server

  • The Cisco UCS C-Series offers several models to address various workload challenges through a balance of processing, memory, I/O, and internal storage resources. When used with Cisco UCS Manager, Cisco UCS C-Series Servers bring the power and automation of unified computing to enterprise applications, including Cisco SingleConnect technology. Cisco SingleConnect unifies local area network (LAN), storage area network (SAN), and systems management into one simplified link for rack servers, blade servers, and virtual machines (VMs). This technology reduces the number of network adapters, cables, and switches needed, and radically simplifies the network, reducing complexity.

  • Cisco UCS C-Series Servers are redy for Cisco Intersight. Cisco Intersight is a new cloud-based management platform that uses analytics to deliver proactive automation and support. Combining intelligence with automated actions can dramatically reduce costs and resolve issues faster.

Cisco UCS C220 M7 Server

  • The Cisco UCS C220 M7 Rack Server delivers superior performance and flexibility compared to its predecessor, and incorporates 4th- and 5th-generation Intel Xeon Scalable processors with up to 60 cores for each socket. Cisco UCS C220 M7 supports the latest Double Data Rate 5 (DDR5) DIMMs, allowing faster memory speeds and greater capacity. Also, it introduces Peripheral Component Interconnect Express (PCIe) Generation 5.0, which offers significantly higher I/O bandwidth than PCIe Generation 4.0, making it ideal for the most demanding workloads.

  • The Intel AMX built-in accelerator improves the performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems, and image recognition. The Intel Data Streaming Accelerator (DSA) and In-Memory Analytics Accelerator (IAA) further enhance the capabilities of Cisco UCS C220 M7, and ensure it can easily handle next-generation applications.

  • The following Cisco UCS C220 M7 Rack Server featuresare important for AI and ML workloads:

    • Supports up to two 4th- or 5th-generation Intel Xeon Scalable CPUs, with up to 60 cores for each socket.

    • Allows up to 4 TB of DDR5 memory.

    • Supports up to three GPUs.

    • Includes a modular LAN on motherboard (mLOM) slot that can be used to install a Cisco UCS Virtual Interface Card (VIC) Series adapter without consuming a PCIe slot that supports quad port 10-, 25-, and 50-Gbps or dual port 40-, 100-, and 200-Gbps network connectivity.

Cisco UCS C240 M7 Server

  • The Cisco UCS C240 M7 Rack Server offers cutting-edge performance and expandability, making it an excellent choice for various storage and I/O-intensive applications, including big data analytics, databases, collaboration, virtualization, consolidation, and HPC. This server, in its two-socket, two rack unit (2RU) form factor, builds on the strengths of the Cisco UCS C240 M6 Rack Server by incorporating advanced technologies and improved capabilities.

  • The following Cisco UCS C240 M7 Server features are important for AI and ML workloads:

    • Supports up to two 4th- or 5th-generation Intel Xeon Scalable CPUs with up to 64 cores for each socket.

    • Allows up to 8 TB of DDR5 memory.

    • Supports up to eight GPUs.

    • Includes an mLOM slot that can be used to install a Cisco UCS VIC Series adapter without consuming a PCIe slot, and supports quad port 10-, 25-, and 50-Gbps or dual port 40-, 100-, and 200-Gbps network connectivity.

Cisco UCS C245 M8 Server

  • The Cisco UCS C245 M8 Rack Server is suited for a wide range of storage and I/O-intensive applications, such as big data analytics, databases, collaboration, virtualization, consolidation, AI/ML, and HPC. It supports up to two AMD CPUs in a 2RU form factor.

  • It is powered by the 4th-generation AMD EPYC processors with more cores for each socket compared to the designs in previous generations. With advanced features like AMD Infinity Guard, compute-intensive applications will significantly improve performance and be more powerful and cost-efficient.

  • The following Cisco UCS C245 M8 Server features are important for AI and ML workloads:

    • Support for up to two 4th-generation AMD EPYC CPUs in a server that is designed to drive as many as 256 CPU cores (128 cores for each socket).

    • Allows up to 6 TB of DDR5 memory.

    • Supports up to eight GPUs.

    • Includes an mLOM slot that can be used to install a Cisco UCS VIC Series Adapters without consuming a PCIe slot and supports quad-port 10-, 25-, and 50-Gbps or dual-port 40-, 100-, and 200-Gbps network connectivity.

Cisco UCS X-Series Modular System

  • The Cisco UCS X-Series Modular System streamlines the data center, accommodates the dynamic demands of modern applications, and supports traditional scale-out and enterprise workloads. Consolidating server types enhances operational efficiency and agility, and reduces complexity. Powered by the Cisco Intersight cloud operations platform, it shifts focus from administrative details to business outcomes through a hybrid cloud infrastructure that is assembled from the cloud, tailored to your workloads, and continually optimized.

  • The following are some of the Cisco UCS X-Series nodes that are suitable for AI/ML:

    • Cisco UCS X210c M7 Compute Node

    • Cisco UCS X410c Compute Node

    • Cisco UCS X440p PCIe Node (suitable for GPU accelerators)

  • Cisco UCS X-Series compute nodes are added into a Cisco UCS X-Series modular chassis to provide the best of blade and rack designs.

Cisco UCS X-Series X9508 Chassis

  • The Cisco UCS X-Series chassis is engineered to be adaptable and flexible. The following figure shows that the Cisco UCS X9508 chassis has only a power distribution midplane.

  • This innovative design provides fewer obstructions for better airflow. For I/O connectivity, vertically oriented compute nodes intersect with horizontally oriented fabric modules, allowing the chassis to support future fabric innovations. The superior packaging of the Cisco UCS X9508 chassis enables larger compute nodes and provides more space for actual compute components, such as memory, GPU, drives, and accelerators. Improved airflow through the chassis enables support for higher-power components, and more space allows future thermal solutions (such as liquid cooling) without limitations.

  • The Cisco UCS X9508 7RU chassis has the following characteristics:

    • Eight flexible slots: You can house a combination of compute nodes and a pool of future I/O resources that may include GPU accelerators, disk storage, and nonvolatile memory.

    • Two Intelligent Fabric Modules (IFMs): These modules connect the chassis to upstream Cisco UCS 6500 Series Fabric Interconnects, enabling a lossless and deterministic converged fabric to connect all blades and chassis together.

    • Two slots for Cisco UCS X-Series fabric modules: These slots can flexibly connect the compute nodes with I/O devices.

    • Six power supply units (PSUs): Each PSU provides 2800 W of power.

Cisco UCS X210c M7 Compute Node

  • The Cisco UCS X210c M7 Compute Node is the second generation of compute nodes to be integrated in the Cisco UCS X-Series Modular System. It delivers performance, flexibility, and optimization for deployments in data centers, cloud, and remote sites. This enterprise-class server offers market-leading performance, versatility, and density without compromising workloads. Up to eight compute nodes can reside in the 7RU Cisco UCS X9508 Server Chassis, offering one of the highest densities of compute, I/O, and storage.

  • The following Cisco UCS X210c M7 features are important for AI and ML workloads:

    • CPU: Up to two 4th-generation Intel Xeon Scalable Processors with up to 60 cores for each processor.

    • Memory: Up to 8 TB of main memory with DDR5 memory.

    • Optional front mezzanine GPU module: The Cisco UCS front mezzanine GPU module is a passive PCIe Generation 4.0 front mezzanine option with support for up to two U.2 Non-Volatile Memory Express (NVMe) drives and two half-height, half-length (HHHL) GPUs.

    • Cisco UCS PCI Mezz card for Cisco UCS X-Fabric: This card can occupy the server's mezzanine slot at the bottom rear of the chassis. Its I/O connectors link to Cisco UCS X-Fabric modules and enable connectivity to the Cisco UCS X440p PCIe Node.

    • Cisco UCS VIC 15000 Series adapters: These adapters support the use of mLOM slot to enable 100-Gbps connectivity.

Cisco UCS X410c M7 Compute Node

  • Cisco UCS X410c M7 Compute Node delivers the best performance, versatility, and density for a wide range of mission-critical enterprise applications, memory-intensive applications, and bare-metal and virtualized workloads. Powered by the Cisco Intersight cloud operations platform, it is an adaptable system that can support enterprise workloads and provide a consolidation platform. It is a system that is engineered for the future so you can embrace emerging technologies while reducing the risk of obsolescence.

  • The Cisco UCS X410c M7 Compute Node is the first four-socket 4th-generation Intel Xeon Scalable Processors computing device to be integrated in the Cisco UCS X-Series Modular System. Up to four compute nodes or two compute nodes and two GPU nodes can reside in the 7RU Cisco UCS X9508 Server Chassis to offer high performance and efficiency gains for a wide range of mission-critical enterprise applications, memory-intensive applications, and bare-metal and virtualized workloads.

  • The following Cisco UCS X410c M7 features are important for AI and ML workloads:

    • CPU: Four 4th-generation Intel Xeon Scalable Processors with up to 60 cores for each processor.

    • Memory: Up to 16 TB of main memory DDR5 memory.

    • Storage: Up to six hot-pluggable solid-state drives (SSDs), or NVMe 2.5-inch drives with a choice of enterprise-class Redundant Array of Independent Disks (RAID) or pass-through controllers, and up to two M.2 Serial Advanced Technology Attachment (SATA) or NVMe drives with an optional hardware RAID.

    • Cisco UCS PCI Mezz card for Cisco UCS X-Fabric: This card can occupy the server's mezzanine slot at the bottom rear of the chassis. Its I/O connectors link to Cisco UCS X-Fabric modules and enable connectivity to the Cisco UCS X440p PCIe Node.

    • Cisco UCS VIC 15000 Series adapters: These adapters support the use of mLOM slot to enable 100-Gbps connectivity.

Cisco UCS X440p PCIe Node

  • The Cisco UCS X440p Gen4 PCIe Node is a new node type that is compatible with the Cisco UCS X9508 chassis. It can be connected to Cisco UCS X210c and X410c compute nodes in the Cisco UCS X9508 chassis to enable GPU accelerator support using Cisco UCS 9416 X-Fabric modules.

  • The Cisco UCS X440p PCIe Node enables integration of a PCIe resource node into the Cisco UCS X-Series Modular System. Up to four PCIe nodes can be housed in the 7RU Cisco UCS X9508 Chassis, each paired with a compute node. This arrangement offers up to four GPUs for a Cisco UCS X210c and X410c Compute Node with Cisco UCS X-Fabric technology.

  • The following GPU options are supported on the Cisco UCS X440p PCIe Card.

GPU Product ID (PID)
PID Description
Maximum Number of GPUs for each Node

UCSX-GPU-A16-D

NVIDIA A16 PCIE 250W 4X16 GB

2

UCSX-GPU-A40-D

TESLA A40 RTX, PASSIVE, 300W, 48 GB

2

UCSX-GPU-A100-80-D

TESLA A100, PASSIVE, 300W, 80 GB

2

UCSX-GPU-H100-80

TESLA H100, PASSIVE, 350W, 80 GB

2

UCSX-GPU-L4

NVIDIA L4 Tensor Core, 70W, 24 GB

4

UCSX-GPU-L40

NVIDIA L40 300W, 48 GB wPWR CBL

2

Cisco UCS 15000 Series Virtual Interface Cards

  • The Cisco UCS VIC 15000 Series is designed for Cisco UCS X-Series M6 and M7 Blade Servers, Cisco UCS B-Series M6 Blade Servers, and Cisco UCS C-Series M6 and M7 Rack Servers. The adapters can support 10-, 25-, 40-, 50-, 100-, and 200-Gigabit Ethernet and Fibre Channel over Ethernet (FCoE). They incorporate the Cisco next-generation converged network adapter (CNA) technology and offer a comprehensive feature set that provides investment protection for future feature software releases. They enable a policy-based, stateless, agile server infrastructure that can present PCIe standards-compliant interfaces to the host. These interfaces can be dynamically configured as either NICs or host bus adapters (HBAs). Some adapters also incorporate secure boot technology.

  • When a Cisco UCS rack server with a VIC 15000 Series adapter is connected to a fabric interconnect (Cisco UCS 6536 or 6300 Series Fabric Interconnects), the VIC is provisioned using Cisco Intersight Managed Mode (IMM) or Cisco UCS Manager policies. When a Cisco UCS rack server with VIC is connected to a top-of-rack (ToR) switch, such as the Cisco Nexus 9000 Series, the VIC is provisioned through the Cisco Integrated Management Controller (IMC) or Cisco Intersight policies for a Cisco UCS standalone server.

Cisco UCS X-Series Modular System

  • The Cisco UCS X-Series Modular System streamlines the data center, accommodates the dynamic demands of modern applications, and supports traditional scale-out and enterprise workloads. Consolidating server types enhances operational efficiency and agility, and reduces complexity. Powered by the Cisco Intersight cloud operations platform, it shifts focus from administrative details to business outcomes through a hybrid cloud infrastructure that is assembled from the cloud, tailored to your workloads, and continually optimized.

  • The following are some of the Cisco UCS X-Series nodes that are suitable for AI/ML:

    • Cisco UCS X210c M7 Compute Node

    • Cisco UCS X410c Compute Node

    • Cisco UCS X440p PCIe Node (suitable for GPU accelerators)

  • Cisco UCS X-Series compute nodes are added into a Cisco UCS X-Series modular chassis to provide the best of blade and rack designs.

Cisco UCS X-Series X9508 Chassis

  • The Cisco UCS X-Series chassis is engineered to be adaptable and flexible. The following figure shows that the Cisco UCS X9508 chassis has only a power distribution midplane.

  • This innovative design provides fewer obstructions for better airflow. For I/O connectivity, vertically oriented compute nodes intersect with horizontally oriented fabric modules, allowing the chassis to support future fabric innovations. The superior packaging of the Cisco UCS X9508 chassis enables larger compute nodes and provides more space for actual compute components, such as memory, GPU, drives, and accelerators. Improved airflow through the chassis enables support for higher-power components, and more space allows future thermal solutions (such as liquid cooling) without limitations.

  • The Cisco UCS X9508 7RU chassis has the following characteristics:

    • Eight flexible slots: You can house a combination of compute nodes and a pool of future I/O resources that may include GPU accelerators, disk storage, and nonvolatile memory.

    • Two Intelligent Fabric Modules (IFMs): These modules connect the chassis to upstream Cisco UCS 6500 Series Fabric Interconnects, enabling a lossless and deterministic converged fabric to connect all blades and chassis together.

    • Two slots for Cisco UCS X-Series fabric modules: These slots can flexibly connect the compute nodes with I/O devices.

    • Six power supply units (PSUs): Each PSU provides 2800 W of power.

Cisco UCS X210c M7 Compute Node

  • The Cisco UCS X210c M7 Compute Node is the second generation of compute nodes to be integrated in the Cisco UCS X-Series Modular System. It delivers performance, flexibility, and optimization for deployments in data centers, cloud, and remote sites. This enterprise-class server offers market-leading performance, versatility, and density without compromising workloads. Up to eight compute nodes can reside in the 7RU Cisco UCS X9508 Server Chassis, offering one of the highest densities of compute, I/O, and storage.

  • The following Cisco UCS X210c M7 features are important for AI and ML workloads:

    • CPU: Up to two 4th-generation Intel Xeon Scalable Processors with up to 60 cores for each processor.

    • Memory: Up to 8 TB of main memory with DDR5 memory.

    • Optional front mezzanine GPU module: The Cisco UCS front mezzanine GPU module is a passive PCIe Generation 4.0 front mezzanine option with support for up to two U.2 Non-Volatile Memory Express (NVMe) drives and two half-height, half-length (HHHL) GPUs.

    • Cisco UCS PCI Mezz card for Cisco UCS X-Fabric: This card can occupy the server's mezzanine slot at the bottom rear of the chassis. Its I/O connectors link to Cisco UCS X-Fabric modules and enable connectivity to the Cisco UCS X440p PCIe Node.

    • Cisco UCS VIC 15000 Series adapters: These adapters support the use of mLOM slot to enable 100-Gbps connectivity.

Cisco UCS X410c M7 Compute Node

  • Cisco UCS X410c M7 Compute Node delivers the best performance, versatility, and density for a wide range of mission-critical enterprise applications, memory-intensive applications, and bare-metal and virtualized workloads. Powered by the Cisco Intersight cloud operations platform, it is an adaptable system that can support enterprise workloads and provide a consolidation platform. It is a system that is engineered for the future so you can embrace emerging technologies while reducing the risk of obsolescence.

  • The Cisco UCS X410c M7 Compute Node is the first four-socket 4th-generation Intel Xeon Scalable Processors computing device to be integrated in the Cisco UCS X-Series Modular System. Up to four compute nodes or two compute nodes and two GPU nodes can reside in the 7RU Cisco UCS X9508 Server Chassis to offer high performance and efficiency gains for a wide range of mission-critical enterprise applications, memory-intensive applications, and bare-metal and virtualized workloads.

  • The following Cisco UCS X410c M7 features are important for AI and ML workloads:

    • CPU: Four 4th-generation Intel Xeon Scalable Processors with up to 60 cores for each processor.

    • Memory: Up to 16 TB of main memory DDR5 memory.

    • Storage: Up to six hot-pluggable solid-state drives (SSDs), or NVMe 2.5-inch drives with a choice of enterprise-class Redundant Array of Independent Disks (RAID) or pass-through controllers, and up to two M.2 Serial Advanced Technology Attachment (SATA) or NVMe drives with an optional hardware RAID.

    • Cisco UCS PCI Mezz card for Cisco UCS X-Fabric: This card can occupy the server's mezzanine slot at the bottom rear of the chassis. Its I/O connectors link to Cisco UCS X-Fabric modules and enable connectivity to the Cisco UCS X440p PCIe Node.

    • Cisco UCS VIC 15000 Series adapters: These adapters support the use of mLOM slot to enable 100-Gbps connectivity.

Cisco UCS X440p PCIe Node

  • The Cisco UCS X440p Gen4 PCIe Node is a new node type that is compatible with the Cisco UCS X9508 chassis. It can be connected to Cisco UCS X210c and X410c compute nodes in the Cisco UCS X9508 chassis to enable GPU accelerator support using Cisco UCS 9416 X-Fabric modules.

  • The Cisco UCS X440p PCIe Node enables integration of a PCIe resource node into the Cisco UCS X-Series Modular System. Up to four PCIe nodes can be housed in the 7RU Cisco UCS X9508 Chassis, each paired with a compute node. This arrangement offers up to four GPUs for a Cisco UCS X210c and X410c Compute Node with Cisco UCS X-Fabric technology.

  • The following GPU options are supported on the Cisco UCS X440p PCIe Card.

GPU Product ID (PID)
PID Description
Maximum Number of GPUs for each Node

UCSX-GPU-A16-D

NVIDIA A16 PCIE 250W 4X16 GB

2

UCSX-GPU-A40-D

TESLA A40 RTX, PASSIVE, 300W, 48 GB

2

UCSX-GPU-A100-80-D

TESLA A100, PASSIVE, 300W, 80 GB

2

UCSX-GPU-H100-80

TESLA H100, PASSIVE, 350W, 80 GB

2

UCSX-GPU-L4

NVIDIA L4 Tensor Core, 70W, 24 GB

4

UCSX-GPU-L40

NVIDIA L40 300W, 48 GB wPWR CBL

2

Cisco UCS 15000 Series Virtual Interface Cards

  • The Cisco UCS VIC 15000 Series is designed for Cisco UCS X-Series M6 and M7 Blade Servers, Cisco UCS B-Series M6 Blade Servers, and Cisco UCS C-Series M6 and M7 Rack Servers. The adapters can support 10-, 25-, 40-, 50-, 100-, and 200-Gigabit Ethernet and Fibre Channel over Ethernet (FCoE). They incorporate the Cisco next-generation converged network adapter (CNA) technology and offer a comprehensive feature set that provides investment protection for future feature software releases. They enable a policy-based, stateless, agile server infrastructure that can present PCIe standards-compliant interfaces to the host. These interfaces can be dynamically configured as either NICs or host bus adapters (HBAs). Some adapters also incorporate secure boot technology.

  • When a Cisco UCS rack server with a VIC 15000 Series adapter is connected to a fabric interconnect (Cisco UCS 6536 or 6300 Series Fabric Interconnects), the VIC is provisioned using Cisco Intersight Managed Mode (IMM) or Cisco UCS Manager policies. When a Cisco UCS rack server with VIC is connected to a top-of-rack (ToR) switch, such as the Cisco Nexus 9000 Series, the VIC is provisioned through the Cisco Integrated Management Controller (IMC) or Cisco Intersight policies for a Cisco UCS standalone server.

GPU Sharing

  • GPU sharing allows multiple users to share physical GPU resources. Two common approaches, GPU virtualization and Multi-Instance GPU (MIG), are used when GPU sharing is needed. These two approaches are commonly used in HPC, especially AI and ML solutions, where hardware costs are very high. Therefore, these approaches require efficient high hardware utilization and expect high computational power, high memory bandwidth, low latency, and as much CPU offload as possible. Satisfying these requirements can significantly improve the TCO and system performance.

  • Many HPC, AI, and ML tasks do not always require the full performance and resources of a high-end GPU. This underutilization leads to inefficiencies, increased costs, and unpredictable performance. Therefore, to address these challenges, flexible allocation of GPU and CPU resources is needed. Solutions like NVIDIA's MIG, GPU virtualization, and NVIDIA's GPUDirect technologies address these challenges.

GPU Virtualization

  • GPU virtualization allows multiple VMs or containers to share the physical GPU resources.

  • There are two options for achieving GPU virtualization:

    • NVIDIA virtual GPU (vGPU) software-enabled solution: This option partitions a single GPU into multiple virtual GPU and enables GPU resource sharing across multiple VMs.

    • NVIDIA MIG: This option allows a single GPU to be partitioned into up to seven independent instances with dedicated resources.

  • The following table summarizes the key differences between these two approaches:

||NVIDIA MIG|NVIDIA vGPU| |GPU Partitioning|Spatial (hardware)|Temporal (software)| |Maximum Number of Partitions|7|10| |Compute Resources|Dedicated|Shared| |Compute Instance Partitioning|Yes|No| |Address Space Isolation|Yes|Yes| |Fault Tolerance|Yes (highest quality)|Yes| |Low-Latency Response|Yes (highest quality)|Yes| |NVIDIA NVLink Support|No|Yes| |Multitenant|Yes|Yes| |NVIDIA GPUDirect RDMA|Yes (GPU instances)|Yes|

  • Note: Temporal (software) GPU partitioning, or time-sharing, provides a way to share access to a GPU. It provides software-level isolation between the workloads in terms of address space isolation, performance isolation, and error isolation. Spatial (hardware) GPU partitioning provides hardware isolation between the workloads, plus consistent and predictable quality of service (QoS) for all instances running on the GPU.

NVIDIA NVLink is a high-speed connection for GPUs and CPUs that is formed by a robust software protocol and typically rides on multiple pairs of wires printed on a computer board. It lets processors send and receive data from shared memory pools at lightning speed. In its fourth generation, NVLink connects host and accelerated processors at rates up to 900 Gbps, which is more than seven times the bandwidth of PCIe Gen 5, the interconnect used in conventional x86 servers.

NVIDIA vGPU Software

  • Software makes a vGPU work. The NVIDIA vGPU software is installed at the virtualization layer along with the hypervisor. This software creates virtual GPUs that enable every VM to share a physical GPU installed on the server or allocate multiple GPUs to a single VM to power the most demanding workloads. The NVIDIA virtualization software includes a driver for every VM. Because work that was typically done by the CPU is offloaded to the GPU, the user has a much better experience, and demanding engineering and creative applications can be supported in a virtualized and cloud environment.

  • The relevant layers are shown in the following figure.Alt text

  • NVIDIA vGPU software uses temporal partitioning and has I/O memory management unit (IOMMU) protection for VMs that are configured with vGPUs. NVIDIA vGPU provides access to shared resources and the GPU's execution engines: graphics, compute, and copy engines. A GPU hardware scheduler is used when VMs share GPU resources. This scheduler uses time slicing to impose limits on GPU processing cycles. This scheduling dynamically harvests empty GPU cycles and allows for efficient use of GPU resources.

  • NVIDIA vGPU software can partition an NVIDIA A100, for example, to up to 10 vGPUs. Thus, 10 VMs can access this shared resource (40 GB of GPU memory), with 4 GB of GPU memory allocated per VM. A vGPU is assigned to a VM using a vGPU profile. The NVIDIA Compute Driver or NVIDIA Graphics Driver is then used to enable GPU functionality in apps and VMs.

  • The NVIDIA vGPU software consists of the following products:

    • NVIDIA RTX Virtual Workstation (vWS)

    • NVIDIA Virtual Compute Server (vCS)

    • NVIDIA Virtual PC (vPC)

    • NVIDIA Virtual Applications (vApps)

  • NVIDIA RTX vWS is ideal for compute-intensive workloads, such as accelerated graphics, NVIDIA Omniverse, and AI development. It is suitable for high-end designers who use powerful 3D content creation applications such as Dassault CATIA, SOLIDWORKS, 3DExcite, Autodesk Maya, and others. NVIDIA vWS allows users to access their applications with full features and performance, anywhere, on any device.

  • NVIDIA vCS enables the benefits of hypervisor-based server virtualization for GPU-accelerated servers. Data center admins are now able to power any compute-intensive workload with GPUs in a VM. vCS software virtualizes NVIDIA GPUs to accelerate large workloads, including more than 600 GPU accelerated applications for AI, deep learning, and HPC. With GPU sharing, a single GPU can power multiple VMs to maximize utilization and affordability, or multiple virtual GPUs can power a single VM can to make even the most intensive workloads possible.

  • NVIDIA vPC is ideal for users who want a virtual desktop but need a great user experience with PC Windows applications, browsers, and high-definition video. NVIDIA vPC delivers a native experience to users in a virtual environment, allowing them to run all their PC applications at full performance.

  • NVIDIA vApps is siutable for organizations deploying Citrix Virtual Apps and Desktops, Microsoft Remote Desktop Sessions Hosts (RDSH), or other app streaming or session-based solutions. Designed to deliver PC Windows applications at full performance, NVIDIA vApps allows users to access any Windows application at full performance on any device, anywhere.

NVIDIA MIG

  • In 2020, NVIDIA launched the Multi-Instance GPU (MIG) feature. It allows a single GPU to be partitioned into multiple GPU instances, each with its own dedicated resources like GPU memory, compute, and cache. MIG provides strict isolation, which allows multiple workloads to run on a single GPU without interfering with each other, and ensures predictable performance. For cloud service providers with multitenant use cases, MIG ensures that one client cannot impact the work or scheduling of other clients, and provides enhanced customer isolation.

  • As shown in the following figure, with MIG, each GPU instance has a separate and isolated path through the entire memory system: the on-chip crossbar ports, Layer 2 cache banks, streaming multiprocessors (SM in the figure), memory controllers, and DRAM address busses, which are all assigned to an individual instance. This approach ensures that a user's workload can run with known throughput and latency, with the same Layer 2 cache allocation and DRAM bandwidth.Alt text

  • MIG can partition available GPU compute resources to provide a defined QoS with fault isolation for clients such as VMs, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single, physical NVIDIA GPU.

  • MIG supports the following deployment configurations:

    • Bare-metal, including containers.

    • GPU pass-through virtualization to Linux guests on top of supported hypervisors.

    • vGPU on top of supported hypervisors.

  • MIG offers the following benefits:

    • Resource efficiency: GPU utilization is maximized by allowing multiple workloads to share a single GPU.

    • Predictable performance: Each GPU instance operates in isolation, which ensures consistent performance for each workload.

    • Flexibility: MIG can be configured and dynamically reconfigured to create GPU instances of various sizes to match workload requirements.

    • Cost-efficiency: MIG can lead to cost savings as businesses get more out of their existing GPU infrastructure.

    • Enhanced security: Each MIG partition has its own dedicated memory and compute cores. This approach ensures that different workloads do not interfere with each other and reduces the attack surface.

Compute Resources Sharing

  • Efficient deployment of compute resources involves several strategies that are designed to maximize performance, minimize costs, and optimize energy use. It is very important to minimize TCO and environmental impact through efficient operations. In this topic, you will learn about different strategies for efficiently deploying compute resources to multiple developers. You can use solutions such as VMs, containerization, infrastructure as code (IaC), and cloud services, to effectively deploy the needed compute resources.

Virtual Machines

  • Virtualization is an abstraction layer that decouples the physical hardware from the operating system to deliver greater IT resource utilization and flexibility. It allows multiple VMs with heterogeneous operating systems and applications to run in isolation, side-by-side, on the same physical machine.

  • A VM is the representation of a physical machine by software. It has its own set of virtual hardware (for example, RAM, CPU, NIC, hard disks, and so on) upon which an operating system and applications are loaded. The operating system sees a consistent, normalized set of hardware regardless of the actual physical hardware components. This hardware arrangement is provided by a hypervisor, as illustrated in the following figure.

  • VMs are intended to increase server utilization by running more applications per server. Consolidating underutilized servers helps organizations improve operational efficiency and reduce costs.

  • The following are common features of VMs:

    • Self-contained systems: A single host can simultaneously run multiple VMs and multiple operating system environments.

    • Protection from unstable applications: The VM is isolated from other VMs.

    • Less maintenance: Individual VMs can be expanded, shrunk, and moved without impacting other VMs.

    • Fast deployment: VMs can be created and deployed quickly, reproduced at scale, and moved to other hosts.

  • VMs are part of modern DevOps practices that emphasize containerization, automation, software-defined resources, and rapid development cycles. VMware offers multiple products with its hypervisor VMware vSphere ESXi. VMware vSphere and VMware vCenter increase the features and functions that you can use to deploy a data center. Hyper-V is the Microsoft native hypervisor and is managed from the CLI, Microsoft Management Consoles, or System Center VM Manager.

Containerization

  • A container engine is an operating system-level virtualization platform that allows developers to create, manage, and deploy lightweight, portable containers. Containers encapsulate an application and its dependencies and ensures consistency across various environments. This technology simplifies development by providing isolated environments on a shared operating system, which improves efficiency and scalability, as illustrated in the following figure.

  • Containers enable abstraction separate from an operating system and the infrastructure that an application needs to run on. To explain how containerization achieves abstraction, it is important to understand these main concepts and components:

    • The container image contains all the information for the container to run, such as application code, day-zero configuration, and other dependencies.

    • The container engine pulls the container images from a repository and runs them.

    • The container is a container image that the container engine initiated or executed.

  • Containers, managed with tools like Docker and Kubernetes, allow you to package applications and their dependencies in a consistent environment. This approach ensures that developers work in identical environments and simplifies resource management.

  • Managing thousands of containers is a challenge. Container orchestrators are software platforms that address this challenge by automating the lifecycle management of containers, including these tasks:

    • Provisioning and deployment

    • Resource allocation

    • Uptime and application health

    • Dynamic scaling up and down

    • Service discovery

    • Networking, security, and storage

  • The most popular container orchestration platform is Kubernetes. Kubernetes can be deployed on top of almost any infrastructure, on-premises, at the edge, or on public clouds.

  • One advanced end-to-end solution example is Run:ai, an AI orchestration platform that offers effective solutions for managing and streamlining AI workflows. When integrated with OpenShift on Cisco UCS X-Series, Run:ai can help optimize AI and ML workloads. OpenShift, a Kubernetes-based platform, provides the perfect environment for deploying and managing Run:ai to enable containerization and automation of AI workloads.

Infrastructure as Code

  • IaC is a method of defining and provisioning infrastructure using definition files containing code. IaC enables IT and development teams to automate and scale the provisioning and management of IT resources that are aligned with application source-code releases. It also allows you to automate the deployment of resources consistently and reliably.

  • The process is illustrated in the following figure.Alt text

  • Because repeatable, reusable code enables advanced automation, engineering teams can set up infrastructure simply by running a script. This process can be used at all stages of the application development process (including design, testing, and production) result in more efficiency and better alignment between application and infrastructure, and results in faster development timelines.

  • Immutable infrastructure gives application development teams more confidence to test and run their applications, because new versions of the infrastructure are always deployed as brand-new, purpose-built deployments. The reusable code can also help reduce human error.

  • Numerous IaC tools are available, with some overlap and differences among them. The following are two types of IaC tools:

    • Orchestration tools: These tools provision server instances and leave configuration to other tools.

    • Configuration management tools: These tools install and manage deployments on existing server instances.

  • IaC tools use one of two programming logic types: declarative or imperative. With the declarative approach, the end state of the infrastructure is specified, the tool assembles it automatically, and always aims to maintain the desired state. This method is most commonly used by enterprises.

  • With the imperative approach, the tool helps prepare automation scripts that are then used to assemble infrastructure one step at a time. Although this approach requires more work, it has the advantage of requiring less expertise and can reuse existing automation scripts from previous deployments.

  • There are many IaC tools available, but the IT industry has converged around top tool vendors such as HashiCorp with Terraform and Red Hat with Ansible.

Cloud Computing

  • Cloud computing delivers computing services, including servers, storage, databases, networking, software, and analytics, over the internet. You typically pay only for the cloud services you currently use, which helps you lower your TCO, run your infrastructure more efficiently, and scale in and out as your daily business needs change.

  • The key benefits of cloud computing are as follows:

    • Cost: Cloud computing lets you offload some or all the expense and effort of purchasing, installing, configuring, and managing bare-metal servers and other on-premises infrastructure. You pay only for the cloud-based infrastructure as you use it.

    • Scale: You can scale capacity up and down in response to compute demands.

    • Performance: Additional compute resources can be provisioned in minutes, so there is no need for capacity planning compared to in-house resources.

    • Deployment speed: With most of the software already installed, you can quickly request infrastructure and compute based on pre-existing templates.

    • Reliability: The cloud provider uses multiple redundant sites.

    • Security: Cloud providers offer a broad set of policies and technologies to help you protect your data, apps, and infrastructure.

  • The following are the types of cloud computing:

    • Public cloud: In a public cloud, the cloud provider owns and manages all hardware, software, and other supporting infrastructure.

    • Private cloud: A private cloud is one in which the services and infrastructure are maintained on a private network, and cloud computing resources are used exclusively by a single organization or government.

    • Hybrid cloud: Public and private clouds work together using technology that allows data and applications to be shared. This approach gives your business greater flexibility, more deployment options, and helps optimize your existing infrastructure, security, and compliance.

  • The following are the types of cloud services:

    • Infrastructure as a service (IaaS): This service includes servers and VMs, storage, networks, and operating systems. Examples include Amazon Web Services (AWS), Microsoft Azure, and Google Compute Engine.

    • Platform as a service (PaaS): This service is an on-demand environment for developing, testing, delivering, and managing software applications. Examples include Microsoft Azure, SAP Cloud, Google App Engine, and Red Hat OpenShift.

    • Software as a service (SaaS): Cloud providers host and manage the software application and underlying infrastructure and handle any maintenance, such as software upgrades. Examples include Microsoft 365, Cisco Webex, and Google Workspace Apps.

    • Serverless computing: This service enables developers to build applications faster by eliminating the need for them to manage infrastructure. With serverless applications, the cloud service provider automatically provisions, scales, and manages the infrastructure that is required to run the code. The tasks associated with infrastructure provisioning and management are invisible to the developer. Examples include Google Cloud Functions, Microsoft Azure Functions, AWS Lambda, and serverless Kubernetes.

Total Cost of Ownership

  • Evaluating the TCO is important for any IT investment. The rise of AI has changed the way data centers operate. As AI continues to demand capacity, it significantly impacts the TCO of data centers. AI not only increases the demand for resources, it also presents opportunities to optimize efficiency and sustainability with different virtualization options, including CPU and GPU virtualization.

  • This topic primarily focuses on the expenses that are related to the computational infrastructure necessary to support AI workloads.

  • The following figure shows the TCO of the compute resources split into multiple categories.

Explore Total Cost Of Ownership Categories

  • Hardware costs: In this category, hardware costs refer to the purchase costs of servers equipped with state-of-the-art CPUs, GPUs, and possibly Tensor Processing Units. The Tensor Processing Unit is an AI accelerator ASIC developed by Google for neural network ML, using Google's own TensorFlow software.

    • In addition to compute, storage in a form of SSDs is also needed to handle large amounts of data.

    • One of the key enablers for the compute performance is also high-speed and ultra-low latency networking equipment to ensure connectivity and high-speed data transfer between GPUs and data storage, if possible without involving CPUs (CPU offload capability). This can be achieved by using SmartNICs and specialized equipment, RDMA using InfiniBand, and RoCE Ethernet fabrics. This cost does not apply when the infrastructure is in the cloud

  • Data Center Infrastructure Costs: When discussing data center infrastructure costs, costs for renting or rebuilding a space for data center, cooling, adequate power supply, uninterruptible power supply (UPS), and safety equipment (suc as IT-compatible fire alarm and extinguisher systems and early fire detection systems) need to be considered. This cost does not apply when the infrastructure is in the cloud.

  • Operational Costs: Electricity costs for power consumption to power servers, storage, networking, and cooling, need to be considered as operational costs. Also, regular maintenance and upgrade of hardware components need to be considered to ensure continuous operation.

    • In addition to hardware components, software maintenance, such as updates and patches for system software and firmware need to be considered.

  • Compliance Costs: Compliance costs typically increase as the regulation around an industry increases. With the development of AI-based technologies, regulatory and legal challenges are emerging. The European Union, the United States, and several other countries are all working on regulating this technology. Global companies that have operations all over the world face higher compliance costs than a company operating solely in one location due to various legal requirements. However, it is important to acknowledge that AI systems should also comply with current regulations and standards and meet ethical, security, and robustness requirements.

  • Downtime Costs: While you could include downtime along with the operational costs, the downtime is often so large that it is recommended to have it in its own category. Downtime involves the costs of employees whose work is delayed and who address the issue, costs from lost production, and possibly lost customers from inability to meet time expectations.

  • Cloud versus On-Premises Costs: When considering costs, there is an important distinction between cloud and on-premises infrastructure.

    • For cloud computing, you must take the following costs into the account:

      • Cloud service fees, in case you are using a cloud service provider.

      • Scaling costs, which might be temporary based on the workload

      • Data transfer costs, in case you go over the subscribed data usage threshold.

    • For on-premises computing, the following costs need to be considered:

      • Higher costs for setting up the infrastructure (hardware and data center infrastructure costs)

      • Regular upgrade of hardware components to achieve cutting-edge performance if such are the requirements (include servers, storage, networking, or individual components like GPUs)

      • Resource utilization: Can all hardware be efficiently used compared to cloud computing where you only pay for the compute in use

  • Training and Support Costs: A percentage of the costs must also be associated with staff to use and manage the new technologies, including the training phases. This includes system administrators, software developers, data science engineers and scientists.

  • Scalability Costs: Scalability refers to enhancing a system's capacity to manage an increased workload effectively. It is a long-term strategy, It involves either upgrading the capabilities of existing resources (scaling up) or incorporating additional resources to share the load (scaling out). Scaling up might include boosting memory, proccesing power, or storage withint the current setup.

    • In the short term, you need to consider the following:

      • Elasticity: Defined as "the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an automatic manner so that the available resources match the current demand as closely as possible". Such an example would be GPU partitioning when having multiple workloads that might not need all the compute that is currently available.

      • Over-provisioning: It referes to the practice of allocating more than needed computing resources (such as CPU, memory, and storage) to a system or network, incurring unnecessary costs.

      • Under-provisioning: It refers to the practice of allocating less computing resources to a system or network than needed, leading to reduced performance and possible downtime (out-of-memory exceptions, all storage consumed).

    • Based on compute requirements and time constraints (for example, to train an AI model), these options need to be factored in.

Example TCO Calculation

  • For an organization using cloud-based compute resources for AI, the calculation would typically include the following:

    • Data center infrastructure and hardware costs: Initial Setup: $0 (the organization is using cloud-based compute)

    • Operational costs

      • Cloud service fees: $25,000 per month for compute and storage, including temporary scaling costs

      • Additional data transfer fees over the allocated amount: $1500 per month

    • Downtime costs: $2000 per month

    • Compliance costs: $35,000

    • Staff costs:

      • Support (system administrators, software developers, data science engineers, and scientists): $150,000 per year

      • Training: $15,000 per year

  • To calculate TCO over three years, not accounting for potential price changes and inflation (to simplify the calculation), the calculation would be as follows:

  • The TCO for using cloud-based compute resources, including downtime, compliance, and staff costs, for AI over three years would be approximately $1,556,000. This simplified calculation is crucial for understanding and planning the budget that is required to sustain an organization’s AI infrastructure and make informed decisions about whether to opt for cloud-based or on-premises solutions.

  • Note: This sample calculation is simplified and does not reflect actual prices as they are on the market. It also does not include a risk of potential price changes and inflation, which are usually elements of such a calculation.

AI/ML Clustering

  • AI/ML architectures represent a generational shift in designing HPC environments. To satisfy the required network performance, AI and ML clusters require network infrastructure that is centered on ultra-low-latency and high-bandwidth links.

  • Parallel processing requirements are a game changer and serve as the driving force behind AI tasks. Because parallel processing capabilities are required for AI and ML, AI and ML clusters are transitioning from traditional CPU-based to GPU-based nodes.

  • This fundamental change in the architecture results in a change in data traffic flows. These flows are often referred to as "elephant flows" and play a key role in enabling an intensive workload across the network, which separate AI and ML clusters from conventional CPU-based clusters.

  • The backbone of this data traffic is parallel processing, a mechanism where thousands of GPU cores read, process, and store data based on the learning model characteristics deployed by the application. The synchronization of parallel workflows enables data processing efficiency and speed, which is a key factor in the performance of AI and ML applications. As a result, networks have evolved significantly to cater to the demands of parallel workflows.

  • The Cisco Nexus 9000 Series Switches have the hardware and software capabilities to provide the right latency, congestion management mechanisms, and telemetry to meet the requirements of AI/ML applications. With tools such as Cisco Nexus Dashboard Insights for visibility and Cisco Nexus Dashboard Fabric Controller for automation, Cisco Nexus 9000 Series Switches become ideal platforms for building a high-performance AI/ML network fabric.

Characteristics of AI/ML Clusters

  • An AI cluster could be described as one large compute entity with the following key features:

    • It includes many GPUs, storage, and other compute components (accelerators).

    • Its GPU-to-CPU ratio favors GPU (more GPUs than CPUs).

    • GPU, storage, and other compute components are directly coupled to the network and can communicate directly without involving the CPU (CPU offload).

    • Flash storage using NVMe or Flash arrays that are built on NVMe technology provides end-to-end NVMe from the server to the storage array using RoCE version 2 (RoCEv2).

    • A connectionless network can address the density, latency, and bandwidth requirements of unpredictable AI and ML data traffic models.

    • A network in the AI/ML cluster typically has the spine-leaf architecture to build a non-blocking network fabric.

AI/ML Cluster Types

  • Based on the compute and network requirements, two AI/ML cluster types are defined:

    • Distributed training: This type involves creating an ML model, training it by running the model on labeled data examples, then testing and validating the model. Workloads are shared across multiple worker nodes, which work in parallel to accelerate the training process.

    • Production inference: This type puts an ML model to work on live data to produce an output. The inference system accepts inputs from end users, processes the data, feeds it into the ML model, and serves the result back to users.

  • The bandwidth and infrastructure requirements differ based on the cluster type. They are summarized in the following table.

Distributed Training
Production Inference

Node-to-Node Bandwidth

High

Low

Key Metric

Training time of a model

High availability and latency

Operational Mode

Model training is offline

Usually online, requires real-time response

Infrastructure Requirement

Large network with many GPU/CPU hosts

Smaller network with mid-size CPU/GPU hosts

Network Requirements for AI/ML Clusters

  • To satisfy density, latency, and bandwidth requirements, AI and ML clusters have GPUs that are directly connected to the spine-and-leaf architecture and bypass traditional ToR switches. The spine-and-leaf architecture is a two-layer, full-mesh topology composed of a leaf layer and a spine layer, with leaf and spine switches. It was originally implemented in data centers to overcome the limitations of the three-tier architecture, which has more east-west traffic than north-south traffic flow. East-west traffic flows are server-to-server communication within the same data center.

  • In the AI and ML clusters, GPU nodes typically consist of two to eight GPUs within a single network rack, with each GPU having a dedicated network connection to the leaf.

  • For high-performance use cases, where GPU-to-GPU traffic requires 400-Gbps Ethernet connectivity, building a separate network for these communications might be preferable. These types of networks are often referred to as a "back-end network." Back-end networks are generally designed for GPUDirect (GPU-to-GPU) or GPUDirect Storage (GPU-to-storage) communications. The requirements for back-end networks are the same as they are for any RoCEv2 network.

  • For example, in a high-performance 512-GPU cluster, you must connect 512 NICs with 400-Gbps speed. A Cisco Nexus 9364D-GX2A leaf switch will allow you to connect 32 GPUs or NICs to each leaf, leaving 32 ports for spine connectivity to build a nonblocking fabric. To connect all 512 NICs, 16 leaf switches are required.

  • To accommodate the amount of bandwidth coming from the leaf switches, 512 ports with 400-Gbps speed are needed for the spines. An appropriate spine switch is the Cisco Nexus 9364D-GX2A. Each leaf will connect to every spine using four 400-Gbps ports.

  • The network is represented in the following diagram.Alt text

AI/ML Cluster Resources Usage

  • As the demand for GPU resources continues to grow in AI, effective resource management becomes crucial to ensure optimal performance and efficient allocation of these resources. One of those critical resources is the GPU. Fractional GPUs optimize GPU utilization and allow users to right-size their GPU workloads.

  • When an AI workload is executed on a GPU, it uses the memory and compute subsystems of the GPU. A single workload running on a GPU uses the entire memory and compute subsystem capabilities, but when multiple workloads share a single GPU, they compete for access to the available resources.

  • Run:ai is an AI orchestration platform that manages AI workflows. When using shared GPU clusters, Run:ai's fractional GPU technology enables users to experience efficient allocation of GPU resources between multiple workloads sharing one or more GPUs, based on their requirements.Alt text

  • The versatility of Run:ai's GPU compute management solution enables the following use cases:

    • Model inference servers with different priorities: Some inference servers handle real-time requests, which require immediate responses, whereas others handle background tasks or offline requests. Run:ai's solution prioritizes high-priority tasks, ensures that critical tasks receive the GPU cluster resources they need, and avoids performance bottlenecks.

    • Model inference servers with different SLAs: Some inference servers require rapid response times, whereas others can tolerate longer response times.

    • Different users training models on a shared GPU cluster: In R&D environments, multiple users might share a GPU cluster for training AI models. Run:ai's solution ensures fair access to resources and consistent model training performance by allowing users to set priorities and resource allocations.

  • Therefore, Run:ai extends fractional GPU sharing capabilities by offering granular control over GPU compute resource allocation, empowering users to maximize the efficiency of their GPU clusters, and meeting the diverse needs of different workloads. This approach ensures optimal resource utilization and reliable workload performance.

Last updated