“More and more data centers and other high-performance computing environments are using GPUs because of their ability to rapidly process the massive amounts of data generated in deep learning and machine learning applications. However, like many new data center innovations that improve application performance, this innovation exposes new system bottlenecks. In these applications, emerging architectures for improving system performance involve sharing system resources among multiple hosts through a PCIe® fabric.
Microchip Technology Inc.
Firmware Engineering Technical Consultant
More and more data centers and other high-performance computing environments are using GPUs because of their ability to rapidly process the massive amounts of data generated in deep learning and machine learning applications. However, like many new data center innovations that improve application performance, this innovation exposes new system bottlenecks. In these applications, emerging architectures for improving system performance involve passing a PCIe®Fabrics share system resources among multiple hosts.
The PCIe standard (especially its traditional tree-based hierarchy) limits how (and how much) resource sharing can be achieved. However, it is possible to implement a low-latency, high-speed fabric approach that allows a large number of GPUs and NVMe SSDs to be shared among multiple hosts, while still supporting standard system drivers.
The PCIe fabric approach uses dynamic partitioning and multi-host single-root I/O virtualization (SR-IOV) sharing. Point-to-point transfers can be routed directly between PCIe fabrics. This provides optimal routing for point-to-point transfers, reduces root port congestion, and balances the load on CPU resources more efficiently.
Traditionally, GPU transfers had to access the CPU’s system memory, which led to memory sharing contention between endpoints. When the GPU uses its shared memory-mapped resources instead of CPU memory, it can fetch data locally without passing it through the CPU first. This eliminates jumpers and links and the resulting latency, allowing the GPU to process data more efficiently.
Inherent limitations of PCIe
The PCIe master level is a tree structure in which each domain has a root complex, from which point can be extended to “leafs” that travel through switching fabrics and bridges to endpoints. The strict hierarchy and directionality of links impose costly design requirements on multi-host, multi-switch systems.
Figure 1 – Multi-host topology
Take the system shown in Figure 1 as an example. To comply with the PCIe hierarchy, host 1 must have a dedicated downstream port in switch fabric 1 that connects to a dedicated upstream port in switch fabric 2. It also requires a dedicated downstream port in switch fabric 2 that connects to a dedicated upstream port in switch fabric 3, and so on. Host 2 and Host 3 have similar requirements, as shown in Figure 2.
Figure 2 – Tier requirements for each host
Even the most basic system based on the PCIe tree structure requires three links between each switch fabric dedicated to each host’s PCIe topology. Also, since these links cannot be shared between hosts, the system can quickly become extremely inefficient.
Additionally, a typical PCIe compliant tier has only one root port, and while multiple roots are supported in the “Multi-Root I/O Virtualization and Sharing” specification, it complicates the design and is not currently supported by mainstream CPUs. The result is that unused PCIe devices (ie endpoints) are stuck in the host to which they are assigned. It’s not hard to imagine how inefficient this can become in larger systems with multiple GPUs, storage devices and their controllers, and switching fabrics.
For example, if the first host (host 1) has consumed all computing resources, and hosts 2 and 3 are underutilizing resources, then obviously host 1 is expected to have access to those resources. But Host 1 cannot do this because these resources are outside of its hierarchical domain and thus lingering occurs. Non-Transparent Bridging (NTB) is a potential solution to this problem, but it also complicates the system since each type of shared PCIe device requires non-standard drivers and software. A better approach is to use the PCIe fabric, which allows the standard PCIe topology to accommodate multiple hosts with access to each endpoint.
Method of implementation
The system is implemented using a PCIe fabric switch fabric (in this case, a member of the Microchip Switchtec® PAX family) in two separate but transparently interoperable domains: the fabric domain that contains all endpoints and fabric links and a dedicated domain for each host. host domain (Figure 3). The host is kept in a separate virtual domain by the PAX switch fabric firmware running on the embedded CPU, so the switch fabric will always appear as a standard single-layer PCIe device with directly connected endpoints, regardless of where those endpoints appear in the fabric It doesn’t matter.
Figure 3 – Independent domains for each structure
Transactions from the host domain are translated into IDs and addresses in the fabric domain, and vice versa for non-hierarchical routing of communications in the fabric domain. In this way, all hosts in the system can share the fabric links connecting the switching network and the endpoints. The switch fabric firmware intercepts all configuration plane communications from the host (including the PCIe enumeration process) and virtualizes a simple PCIe-compliant switch fabric with a configurable number of downstream ports.
While all control plane traffic is routed to the switch fabric firmware for processing, data plane traffic is routed directly to endpoints. Unused GPUs in other host domains are no longer stuck, as they can be dynamically allocated based on the needs of each host. Peer-to-peer communication is supported within the fabric, which makes it suitable for machine learning applications. Standard drivers can be used when providing functionality to each host in a PCIe-compliant manner.
How to operate
To understand how this method works, let’s take the system in Figure 4 as an example, which consists of two hosts (host 1 uses Windows®system, host 2 uses Linux®system), four PAX PCIe fabric switches, four Nvidia M40 GPGPUs, and a Samsung NVMe SSD that supports SR-IOV. In this experiment, the host runs communications representative of real machine learning workloads, including Nvidia’s CUDA peer-to-peer communications benchmark utility and a TensorFlow model that trains cifar10 image classification. The embedded switch fabric firmware handles low-level configuration and management of the switch fabric, and the system is managed by Microchip’s ChipLink debug and diagnostic utility.
Figure 4: Dual Host PCIe Fabric Engine
Four GPUs are initially assigned to host 1, and the PAX fabric manager displays all devices found in the fabric where the GPUs are bound to the Windows host. However, the structure on the host is no longer complicated, and all GPUs appear to be directly connected to a virtual switch fabric. Subsequently, Fabric Manager will bind all devices and Windows Device Manager will show GPUs. The host sees the switch as a simple physical PCIe switch with a configurable number of downstream ports.
Once CUDA discovered the four GPUs, the peer-to-peer bandwidth test showed a one-way transfer rate of 12.8 GBps and a two-way transfer rate of 24.9 GBps. These transfers go directly across the PCIe fabric without going through the host. If you run the TensorFlow model used to train the Cifar10 image classification algorithm and distribute the workload across all four GPUs, you can release two GPUs back into the fabric pool, unbinding them from the host. This frees up the remaining two GPUs to perform other workloads. Like the Windows host, the Linux host sees the switch as a simple PCIe switch without custom drivers, and CUDA can also discover the GPU and run P2P transfers on the Linux host. The performance is similar to that achieved using a Windows host, as shown in Table 1.
Table 1: GPU peer-to-peer transfer bandwidth
Host 1 average bandwidth
Host 2 average bandwidth
The next step is to connect SR-IOV virtual functions to the Windows host, and PAX provides such functions as standard physical NVM devices so that the host can use standard NVMe drivers. After that, the virtual function will be combined with the Linux host, and the new NVMe device will appear in the list of module devices. The result of this experiment is that both hosts can now use their virtual functions independently.
It is important to note that the virtual PCIe fabric and all dynamic allocation operations are presented to the host in a fully PCIe-compliant manner so that the host can use standard drivers. The embedded switch fabric firmware provides a simple management interface so that the PCIe fabric can be configured and managed by an inexpensive external processor. Device peer-to-peer transactions are enabled by default and require no additional configuration or management by an external fabric manager.
The PCIe switch fabric fabric is an excellent way to take full advantage of the enormous power of a CPU, but the PCIe standard itself has some hurdles. However, these challenges can be addressed by using dynamic partitioning and multi-host single-root I/O virtualization sharing techniques so that GPU and NVMe resources can be dynamically allocated in real-time to any host in a multi-host system to keep up with machine learning workloads changing needs.