60 cores in SCM, 96 cores in MCM, SMT4 as a gift / ServerNews

60 cores in SCM, 96 cores in MCM, SMT4 as a gift / ServerNews


Recent days have been rich in announcements of new processors. IBM unveiled the latest POWER10 with OMI DDR5 and PCI Express 5.0 memory support, Intel announced Xeon Ice Lake-SP, which finally got PCIe 4.0 support. The third on this list can be called Marvell, which at the Hot Chips 32 event revealed details about the latest, third generation of ARM ThunderX processors, formally announced back this spring.

Processors with the ARM architecture conquered the segment of mobile devices, but in the last few years another trend is more interesting – this architecture forms the basis of more and more “large” processors intended for server use. And as practice shows, once considered “weak” architecture is not at all like that.

It successfully competes with x86, especially where a high density of computing power and high energy efficiency is required. Examples of AWS Graviton2 and custom Google processors are proof of this, and the development of Fujitsu, the A64FX processor, is at the heart of the most powerful supercomputer on the planet, the Japanese Fugaku cluster.

Marvell is one of the companies making serious efforts to reach the server market using the ARM architecture. If the first ThunderX processors inherited from Broadcom can hardly be called successful, then the second generation has already shown itself well, and, apparently, the third is finally ready for mass adoption. Recall that unlike home projects AWS and Google, ThunderX3 processors should receive advanced multithreading support, at the SMT4 level, which is more than x86, but less than POWER10.

At the same time, the maximum number of cores in ThunderX3 is impressive. Now we know that 96 cores are only in a dual-die configuration (this approach of Marvell reminds IBM POWER10, which also exists in two versions). One crystal can carry up to 60 cores, which is less than that of Graviton2, but, firstly, not much, and secondly, it is more than compensated by the presence of SMT. SMT4 can provide 240 or 384 streams, depending on the version, and for sure it will appeal to large cloud providers, since it will allow you to host an unprecedented number of VMs within a single socket.

Single-threaded performance has not been overlooked. The company claims a 30% superiority over ThunderX2 per stream. In general, the third generation ThunderX should be 2-3 times faster than the second. Architecturally, the processor is based on the ARM v8.3 instruction set, however, it is said about partial support for ARM v8.4 / 8.5.

There is no consensus in the dispute about which is more efficient for connecting cores to each other, ring buses or a single mesh network. Intel prefers the first approach, but Marvell opted for the second. As usual, the outer ring contains the cache (80 MB L3 per chip), power management units, as well as memory, PCI Express, and interprocessor bus (in this case, CCPI) controllers.

SMT4 support is fully hardware-based. From the point of view of the operating system, each ThunderX3 thread looks like a regular processor with ARM architecture. At the same time, the implementation of such a developed multithreading led to only a 5% increase in the crystal area in comparison with the single-threaded implementation.

The division of the kernel resources of the new processor is dynamic, it is carried out at four points: sampling, when threads with fewer instructions receive a higher priority; execution that works on the same principle; planning based on the “age” of the stream; finally, “retirement” – here the priority is given to the threads with the largest number of instructions. The multithreading optimization allows Marvell to talk about almost linear scalability of new processors, at least within one socket. Depending on the number of instructions per core, the gain can vary from x1.28 to 2.21.

The I / O subsystem of the new products is quite developed. The memory controller has 8 channels and supports DDR4-3200. 16 separate controllers supporting the fourth version of the standard are responsible for PCI Express support. This should provide a high level of performance when connected to 16 NVMe drives, each of which will have four PCIe lanes.

It is declared about “thin” power management, but Marvell does not give details and one can only guess how advanced this ThunderX3 subsystem is. The new processor is being manufactured at TSMC facilities using the 7 nm process technology. The single 60-core version will hit the market later this year, while the dual-die version with more total cores will begin shipping later in 2021. The company is already working on the ThunderX4, these processors are expected to use the 5nm process technology and will be released in 2022.