In the news about the announcement of a coprocessor for SSD from Pliops, designed to unload the CPU from these tasks, it was mentioned that others are developing similar solutions. The concept of a “data processing unit” (DPU) is already quite mature, but there is still no unity in the architectural and software approach. What does the modern DPU market have to offer?
Mellanox actively experimented with data processing on the side of the network adapter – the most peripheral device in any server system. Now, being part of NVIDIA, it continues its work. The main development of NVIDIA / Mellanox in this area can be called the BlueField-2 chip, on the one hand it provides the functionality of a typical SmartNIC (one 200G Ethernet port or two 100G class ports), and on the other hand, it supports NVMe over Fabrics and offloads CPUs in everything. with regard to I / O tasks.
This solution contains both an array of ARM cores and specialized ASIC units for accelerating various functions. This is complemented by the presence of onboard 16 GB of DDR4 RAM. NVIDIA sees DPUs like the BlueField-2 as part of the “CPU + GPU + DPU” bundle. Using the ARM architecture, this approach is universal and is shared, for example, by Wells Fargo analysts.
But there are other players on the market that are actively implementing the ideas embodied in the DPU concept. These include one of the largest cloud service providers – Amazon Web Services. She developed her own DPU accelerator, the Nitro board. In general terms, this solution is similar to NVIDIA / Mellanox BlueField-2, but the ASIC is used there by another, AWS proprietary.
Elastic Compute Cloud instances run using these PCI Express accelerators. AWS does not limit them to a single offering, but provides various customized versions for computing, machine learning, data storage and processing, and other scenarios. AWS Nitro also contains NVMe and NVMe-OF implementations; it looks like it will become a common place for all DPUs.
A similar project is being worked on by Diamanti, which develops a line of dedicated hyperconverged servers optimized for running Kubernetes containers and performing this task better than standard servers. The series includes models D10, D20 and G20, and in general they are not much different from ordinary machines, but Diamanti machines contain two unique components – an NVMe controller and a 40GbE Ethernet controller with Kubernetes CNI and SR-IOV support.
Diamanti’s solutions are interested in the fact that they use two separate accelerators instead of one, and this has its advantages: for example, the speed of a network connection of 40 Gbps may not be enough in the near future, but in order to meet modern requirements in the Diamanti server it will be enough to change the network accelerator, not touching the NVMe controller board, which is responsible for communicating with the disk subsystem.
Also worth mentioning is Fungible, which we told readers about earlier this year. It was she who was one of the first to voice the term DPU. At the time of the first announcement, in February 2020, Fungible did not have a ready-made accelerator on hand. But the DPU concept is arguably Fungible’s best: it assumes that in such systems, all traffic, from network and content sent from memory to the CPU, to data sent to the GPU, will go through the DPU.
The “data processor” in the Fungible view will become a link that unites all components of a computing system, be it processors, GPUs, FPGA accelerators or flash memory arrays. The company plans to use a proprietary low-latency TrueFabric bus as the interconnect system. Fungible should present a ready-made solution this year.
Finally, Pensando, which began a partnership with renowned storage vendor NetApp in late 2019, is already shipping the Distributed Services Card, DSC-100. They combine in a single chip and one board the functions that Diamanti solves with two separate cards; As already mentioned, this approach has its drawbacks – the entire accelerator will have to be replaced, even if the “accelerating” part is still capable of much, and only a network connection is required to accelerate.
At the heart of the DSC-100 is the Capri processor, which on the network side provides a pair of 100GbE ports with a common packet buffer. A fully programmable data processor communicates with this buffer, but the chip also contains classic ARM cores, as well as “hard” accelerators, for example, a cryptographic one. The programmable, hard and ARM parts communicate through a coherent interconnect system that is connected to the PCIe controller and the RAM array. In general, the solution resembles NVIDIA / Mellanox BlueField-2.
Unfortunately, none of the solutions described have yet become industry standard. Each of them has its own advantages and disadvantages, and most importantly, an incompatible software part. This makes the implementation of DPUs into existing structures a rather difficult process: one must not make a mistake in choosing a supplier and developer, and in addition, a separate purchase, installation and maintenance and support costs are required.
Only giants like AWS can fully provide themselves with the perfect DPU for their tasks. In other words, “data coprocessors” are still niche devices. In order for them to become truly popular, a single unified architecture standard is needed – the same one that provided the versatility and cross-compatibility of graphics processors at one time.