AI-Enabling Hardware
CPUs, GPUs, and DPUs
Application architecture is changing—from monolithic applications that are deployed on a single bare-metal server, to multitiered applications deployed on a couple of virtual machines (VMs), all the way to microservice-based applications deployed as hundreds and even thousands of containers deployed on hundreds of nodes, possibly across multiple data centers. Today’s applications are highly distributed and all of them talk to each other.
Hence, data center infrastructure needs to evolve as well. Because of the slowdown in general CPU performance growth, the industry is witnessing the rise of domain-specific hardware accelerators. CPUs are being supplemented with GPUs and DPUs that specialize in functions like AI workload processing and I/O operations at throughput levels that modern data centers need.
CPUs excel at general-purpose tasks but can be much less efficient than a purpose-built processor for specific tasks. Offloading specific data-intensive workloads (I/O, network, and security functions) from CPU to GPU and DPU improves performance and frees the CPU for more general tasks where it excels.
Explore CPU, GPU and DPU:
CPU: A CPU consists of a few cores that are optimized for sequential processing. CPUs are made to be flexible and can perform a wide range of complex tasks. The following are CPU characteristics:
General-purpose processor
Executes a wide range of tasks
Optimized for single-threaded performance
High clock speeds
GPU: A GPU has thousands of smaller cores designed to handle many calculations simultaneously. The following are GPU characteristics:
Specialized for parallel processing
Contains thousands of smaller, efficient cores designed for handling multiple tasks simultaneously
Optimized for tasks requiring high throughput and parallelism
Lower lock speeds for each core compared to CPUs

DPU: A DPU is a domain-specific accelerator that usually contains a small-scale general-purpose CPU, high-bandwidth memory and dedicated accelerators.
Unlike CPUs and GPUs, DPUs were designed for data processing. Accelerators are tightly paired with onboard general-purpose CPU and memory, with a dedicated internal interconnect, which enables a high-throughput computing system. Since DPUs are new, architectures are still evolving, and there is no clear definition of DPU's functionalities and structure. Some examples of these architectures are NVIDIA BlueField, AMD Pensando, and Intel IPU.
The following are DPU characteristics:
Specialized for data-centric tasks
Combines programmable processors with hardware acceleration for specific networking, storage, and security functions.
Offloads data-intensive tasks from the CPU and GPU
Optimized for high throughput and low latency

These three processing unit types all help support complex computing, but each is suited for different tasks, workloads, and use cases. The following table compares each unit type from the performance, efficiency, flexibility, and cost point of view.
Performance
Best for general-purpose tasks with a focus on single-threaded performance
Best for tasks requiring massive parallelism, such as graphics rendering and ML
Best for offloading data-centric and infrastructure tasks, optimizing network, storage, and security functions.
Efficiency
Efficient for a broad range of tasks, but less so for highly parallel workloads
Highly efficient for parrallel tasks, but can consume more power
Highly efficient for data movement and processing tasks, reducing CPU and GPU load
Flexibility
Most flexible, able to handle a wide range of tasks
Less flexible, specialized for parallel processing
Specialized for specific data-centric tasks, offering less flexibility but high efficiency in those areas.
Cost
Generally lower cost for general-purpose computing
Higher cost, especially for high-performance models used in professional and scientific applications
Cost can vary, often used in data centers where efficiency gains justify the investment
GPU Overview
The most important AI hardware components are processors, such as GPUs. GPUs were initially designed to accelerate graphics in video games by rendering high-quality graphics in real time.
Although GPU and graphics or video cards are often used interchangeably, there is a distinction between these terms. A graphics card refers to a separate purpose-built chip that is mounted on its circuit board (also called a graphics add-in board.)
There are two basic types of GPUs:
Integrated: This GPU is embedded in the CPU.
Discrete: This GPU is a graphics card that is typically attached to a Peripheral Component Interconnect express (PCIe) slot.
Many applications can run well with integrated GPUs. A discrete GPU has a significant performance advantage for more resource-intensive applications with extensive performance demands, such as those with many parallel tasks. Discrete GPUs add processing power but consume additional energy and generate heat, which usually requires dedicated cooling for maximum performance.
GPUs are programmable and allow a wide range of applications:
GPUs for gaming: These GPUs render graphics in both 2-D and 3-D. With better graphics performance, games can be played at a higher resolution, faster frame rates, or both.
GPUs for video editing: These GPUs have parallel processing, built-in AI capabilities, and advanced acceleration for faster and easier video and graphics rendering.
GPUs for the data center: These GPUs offer better support for virtualization, parallel operations, AI, media, media analytics, and 3-D rendering solutions.
The architecture of a GPU is ideal for AI tasks, especially deep learning tasks, because deep learning workloads are usually embarrassingly parallel. In parallel computing, an embarrassingly parallel workload or problem is easily split the problem into many parallel tasks. GPUs have many smaller cores. These cores are specialized for different mathematical operations, such as floating-point operations, and are ideal for handling parallel tasks. Traditional CPUs are more powerful but are optimized for serial operations and do not have many highly specialized computational cores. This architectural advantage of GPUs means they can process enormous amounts of data in parallel. Instead of waiting weeks for a model to train on a CPU, the same task can be completed in days or even hours on a GPU.
GPU Architecture
The GPU is a highly parallel processor architecture that is composed of processing elements and a memory hierarchy. At a high level, GPUs consist of many streaming multiprocessors (SMs in the figure), on-chip Layer 2 cache, and high-bandwidth memory (HBM). Streaming multiprocessors execute arithmetic and other instructions. Data and code are accessed from DRAM via the Layer 2 cache.

NVIDIA GPUs can have up to three distinct GPU core types, depending on the GPU model:
Ray-tracing cores: These cores enable computers to render photorealistic graphics with physically accurate lighting, including reflections, shadows, ambient occlusion, global illumination, and more.
Tensor cores: These cores specifically target mathematical matrix operations. They are ideal for accelerating deep learning and AI workloads.
CUDA cores: The Compute Unified Device Architecture (CUDA) core is the primary processor core type responsible for complex parallel mathematical calculations.
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on GPUs. With CUDA, developers can speed up computing applications by using the power of GPUs.
In GPU-accelerated applications, the sequential part of the workload runs on the CPU, which is optimized for single-threaded performance, while the compute-intensive portion runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C++, Fortran, and Python, and express parallelism through extensions, which are software add-ons that enable GPU capabilities using a regular programming language code.
The GPU supports a wide range of workloads, such as AI visual inference and Virtual Desktop Infrastructure (VDI) workloads. It supports an open, standards-based software stack optimized for density and quality, with critical server capabilities for high reliability, availability, and scalability.
NVIDIA GPUs for AI/ML
NVIDIA has produced many generations of GPUs, primarily graphics cards, such as the GeForce family of GPUs, aimed at the workstation and desktop computer markets. With the evolution of AI and machine learning (ML), each generation of hardware is improved, gains more processing power and new features and focuses on high-performance processing capabilities for use in data centers and AI.
The NVIDIA H100 Tensor Core GPU enables large-scale AI and high-performance computing (HPC). The H100 is designed to provide the highest and most cost-effective performance for deep learning inference and training workloads. It includes the NVIDIA AI Enterprise software suite to streamline AI development and deployment.
NVIDIA NVLink is a high-speed GPU interconnect offering a faster alternative for multi-GPU systems than traditional PCIe-based solutions. It provides more than 100 Gbps of bandwidth. Connecting two NVIDIA graphics cards with NVLink enables memory and performance to be scaled to meet the demands of large visual computing workloads. With NVIDIA's fourth-generation NVLink, H100 accelerates workloads with a dedicated Transformer Engine for trillion parameter language models. For small jobs, the H100 can be partitioned into Multi-Instance GPU (MIG) partitions.
The MIG feature allows GPUs to be securely partitioned into up to seven separate GPU instances, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is useful for workloads that do not fully saturate the GPU's compute capacity, and therefore, users may want to run different workloads in parallel to maximize utilization. For cloud service providers with multitenant use cases, MIG ensures that one client cannot impact the work or scheduling of other clients and provides enhanced isolation for customers.
The NVIDIA A100 Tensor Core GPU is designed to accelerate compute-intensive workloads such as AI, deep learning, or data science. The NVIDIA A100 is available in the PCIe and Server PCI Express Socket Module (SXM) module form factors. The SXM module form factor is available with servers that support NVIDIA NVLink. The module offers higher performance per card than PCIe equivalents. With these servers, the NVIDIA A100 provides high performance and scaling for hyperscale and HPC data centers that run applications that scale to multiple GPUs, such as deep learning applications.
The NVIDIA A30 Tensor Core GPU is designed to provide cost-effective performance for deep learning inference workloads. For these workloads, the most important consideration is typically the combination of performance per dollar and the flexibility that is provided by support for the MIG feature, which allows GPUs to be securely partitioned into up to four separate GPUs.
H100
Performance
AI training and inference, HPC, data analytics
Up to 7
A100
Performance
AI training and inference, HPC, data analytics
Up to 7
A30
Performance
AI Inference
Up to 4
Intel GPUs for AI/ML
The Intel Data Center GPU Flex Series is a flexible, robust, and open GPU solution for the intelligent visual cloud. The GPU supports various workloads, such as media streaming, cloud gaming, AI visual inference, and VDI workloads. It supports an open, standards-based software stack optimized for density and quality with high reliability, availability, and scalability capabilities.
Intel Data Center GPU Flex 140 is a 75-watt low-profile PCIe Gen4 GPU card for accelerating cloud gaming, media processing and delivery, VDI, and AI visual inference applications in data center servers. Each PCIe card has two GPUs with eight Xe cores, two media engines, and 6 GB of Graphics Double Data Rate 6 (GDDR6) memory attached. Each Flex Series 140 GPU can support up to 12 VDI sessions, making it a compelling solution for high-density VDI deployment.
A Flex Series 170 GPU is used for workloads with higher graphics performance. It is a 150-watt, full-size Gen4 PCIe card with a single GPU node with 32 Xe cores, two media engines, and 16 GB of GDDR6 memory. Flex Series 170 GPU can support up to 16 virtual GPUs (vGPUs).

Intel also enables GPU virtualization. GPU capabilities are exposed using hardware-assisted virtualization, such as Single Root I/O Virtualization (SR-IOV) and hard partitioned graphics. GPU virtualization lets you virtualize the GPU for multiple guest VMs, provide near-native graphics performance in the VM and still allow the host to use the virtualized GPU. Unlike NVIDIA, Intel does not require a license for virtual GPU capability.

DPU Overview
The DPU is a new type of programmable processor. DPUs play a crucial role in moving data between compute nodes and storage as fast as possible.
A DPU is a system-on-chip (SoC) architecture that combines the following characteristics:
Software-programmable multi-core CPU, typically based on the Advanced RISC Machine (ARM) architecture
High-performance network interface
Flexible and programmable acceleration engines
A DPU helps the CPU by taking over its networking and communication workloads. It uses hardware acceleration technology and a high-performance network interface to offload data transfers, data compression, data storage, and data security, tasks that are usually assigned to the CPU.
The DPU can be used as a standalone embedded processor. It is usually incorporated into a SmartNIC, a network interface controller that is a critical component in providing fast and low-latency operations between data center servers.
The network interface must be powerful and flexible enough to handle all network data path processing. The DPU's embedded CPU should be used for control path initialization and exception processing.
The following figure describes the minimum capabilities that the flexible and programmable acceleration engines, components of a DPU, should provide.

NVIDIA BlueField
The NVIDIA BlueField-3 networking platform is designed to accelerate data center infrastructure workloads. Supporting Ethernet and InfiniBand connectivity, BlueField-3 offers speeds up to 400 Gbps. It combines powerful computing with software-defined hardware accelerators for networking, storage, and cybersecurity. The platform is fully programmable through the NVIDIA DOCA software framework.
The NVIDIA BlueField-3 DPUs offload, accelerate, and isolate software-defined networking, storage, security, and management functions.
The following figure lists the BlueField-3 DPU hardware accelerators.

A typical configuration of a BlueField-3 DPU would include the following:
P-series processor with 16 ARM cores
Support for 400-Gbps Ethernet or InfiniBand
Two quad small form-factor pluggable (QSFP) ports
PCIe Gen5 interface with 16 lanes (x16)
32 GB DDR5 memory
SmartNIC Overview
A NIC is a PCIe card that plugs into a server or a storage box to enable connectivity to an Ethernet network. A DPU-based SmartNIC implements network traffic processing on the NIC that the CPU would otherwise perform, as in the case of a foundational NIC.
SmartNICs are used by telecommunications providers, media companies, high-frequency trading applications (using low latency), and AI and ML workloads. They run in storage servers, database clusters, and data warehouses.
The key features of SmartNICs are as follows:
Accelerates software-defined networking stacks.
Provides security with inline encryption and decryption.
Enables firewalls and a hardware root-of-trust for a secure boot.
Runs security protocols such as TLS and IPsec.
Handles storage and data-access protocols such as RoCEv2, GPUDirect Storage, and Non-Volatile Memory Express (NVMe) over TCP.
Supports virtualized data centers with virtual switching and routing.
Provides from 25 to 400 Gbps of data throughput.

A SmartNIC uses highly specialized hardware units called accelerators that run communications tasks more efficiently than CPUs. It also uses programmable cores that users can program to handle custom tasks in a data center. This combination of accelerators and programmable cores provides performance, flexibility, and ultra-low latency (ULL).
A SmartNIC implements operations that save CPU cycles. This function includes data and control plane (networking, storage, and security) functions, which a SmartNIC can fully implement and offload.
The three types of circuit-based SmartNICs have the following characteristics:
Field-Programmable Gate Array (FPGA)-based
Programmable
Design optimized for a specific application domain
Expensive
ASIC-based
High cost of development
Predefined capabilities
Configurable
Highest performance
SoC-based (also called a DPU)
Good price performance
SoC with additional hardware accelerators
C programmable processors
Highest flexibility
Easiest programmability
FPGAs are integrated circuits. They are called "field-programmable" because they allow customers to reconfigure the hardware to meet specific use case requirements.
ASICs typically offer the best performance and some flexibility, but they are limited to the predefined capabilities of the integrated circuits.
A system-on-chip design, or DPU, blends dedicated hardware accelerators with programmable processors.

Cisco Nexus SmartNIC Family
Cisco Nexus SmartNICs are next-generation, FPGA-based, high-resolution time stamping adapters optimized for networks that require ULL. The ULL technology that is incorporated in Cisco Nexus SmartNICs provides high-performance computing, high-frequency financial trading, and AI and can be used in AI and ML workloads.
The key features of Cisco Nexus SmartNICs are as follows:
ULL: Cisco Nexus SmartNICs are designed and optimized for ultra-low latency operation.
Software acceleration: Accelerate your network applications with Cisco Nexus SmartNIC inline sockets acceleration and kernel bypass technology.
Time synchronization: Cisco Nexus SmartNICs provide high-precision hardware time stamping and time synchronization.
FPGA programmability: Field-upgradable SmartNICs allow customizable functionality.
40 Gigabit Ethernet support: Uses small-form-factor pluggable modules that are designed for speeds up to 40 Gbps.
The Cisco Nexus K3P-S FPGA SmartNIC is specifically optimized for low-latency operation. It features software trigger-to-response latencies as low as 568 ns.
The Cisco Nexus K3P-S FPGA SmartNIC provides a powerful programmable software interface.
Its programmability features include the following:
Zero-latency cost hardware flow steering: This feature allows users to steer and prefilter important traffic to the correct memory and CPU core without additional latency.
Cut-through switching: This feature allows software to forward a packet before the whole packet has been received, usually when the destination address and outgoing interface are determined. Compared to store-and-forward, this technique reduces latency.
ExaSOCK TCP/IP acceleration: This feature provides a transparent TCP/IP acceleration system for socket applications using the ExaSOCK acceleration library.
Preloaded packet transmission: The Cisco Nexus K3P-S FPGA SmartNIC allows users to preload transmit frames, which saves 60 ns of time.
High-resolution timestamps: A 4 ns timestamp is applied to every received packet and to the most recently transmitted packet.
FPGA network adapters support firmware upgrades, which allow new features and speed enhancements to be added after deployment.
NVIDIA BlueField SuperNIC
One of the groundbreaking innovations making generative AI possible is a networking solution developed by NVIDIA called the super network interface card (SuperNIC). SuperNICs are uniquely optimized to accelerate AI networks and provide robust and seamless connectivity between GPU servers.
Distributed AI training and inference communication flows depend heavily on network bandwidth availability. SuperNICs scale more effectively than DPUs to deliver an impressive 400-Gbps network bandwidth per GPU. The purpose of the SuperNIC is to accelerate networking for AI cloud computing. It achieves this goal using less computing power than a DPU, which requires substantial computational resources to offload applications from a host CPU.
The NVIDIA BlueField-3 SuperNIC is a class of network accelerators that is based on the BlueField-3 networking platform, which is purpose-built for supercharging hyperscale AI workloads. Designed for network-intensive parallel computing, the BlueField-3 SuperNIC provides up to 400 Gbps of RoCE network connectivity between GPU servers to optimize peak AI workload efficiency.
NVIDIA BlueField-3 SuperNICs have the following attributes:
High-speed packet reordering support is availabkle when combined with an NVIDIA network switch. This combination ensures that data packets are received and processed in the same order that they were originally transmitted, which maintains the sequential integrity of the data flow.
Advanced congestion control using real-time telemetry data and network-aware algorithms manage and prevent congestion in AI networks.
Programmable compute on the I/O path enables network infrastructure customization and flexibility in AI cloud data centers.
The power-efficient, low-profile design efficiently accommodates AI workloads within constrained power budgets.
Full-stack AI optimization includes compute, networking, storage, system software, communication library, and application frameworks.
Last updated