no worse than x86-64 / ServerNews

no worse than x86-64 / ServerNews


Ampere Altra processors were announced in spring 2020. At the OCP Virtual Summit 2020, GIGABYTE unveiled the MP32-AR0 motherboard with a socket for Ampere processors, and in the fall it announced the new R272-P30 (Mount Snow) server series. Now Ampere has sent out two-socket Mount Jade platforms to foreign reviewers, and the first test results of new CPUs are encouraging.

Ampere Altra chips have up to 80 cores with ARM v8.2 + architecture (with some improvements from the v8.3 and 8.4 kits), interconnected by the Arm CoreLink CMN-600 mesh bus. These cores are supported by an advanced cache system: 64 + 64 KB L2, 1 MB L2 and up to 32 MB of shared L3. The memory subsystem has 8 DDR4-3200 channels (72-bit, 2DPC, up to 4 TB in total).

To connect peripherals, there is a PCIe 4.0 controller for 128 lines, but in the two-socket version, 32 lines on each side are allocated for communication between the CPUs, which in total gives 192 lines. Moreover, CCIX is used for communication. Separately, it should be noted that Ampere follows the path of AMD – the cost of a CPU depends only on the number of cores and their frequency, and otherwise the functionality of cheaper models does not differ from that in older versions of CPUs.

Unlike traditional Intel Xeon, AMD EPYC, and even more so IBM POWER9 / 10, there is no multithreading in Altra. However, developers call this an advantage: the move away from SMT has allowed them to reduce energy consumption – an indicator that is extremely important for the high-density server market. In addition, increased safety was cited as one of the reasons.

Recommended prices for Ampere Altra Quicksilver processors. AnandTech data

Recommended prices for Ampere Altra Quicksilver processors. AnandTech data

Mount Jade samples sent to foreign reviewers received two top 80-core Altra Q80-33 processors, operating at 3.3 GHz and having a thermal package of 250 watts, as well as 512 GB DDR4-3200. Unlike the single-processor version, the dual-processor version was developed in collaboration with Wiwynn, a renowned developer and supplier of OCP platforms.

The Ampere processor socket does not yet have its name; by analogy with Intel solutions, it can be called LGA 4926. This is more than the second generation Xeon Scalable, and even more than Cooper Lake with its 4189 contacts. The mechanism for installing the heatsinks, however, is more reminiscent of AMD SP3: there is a familiar hinged frame fixed with five screws. The processor itself has impressive dimensions: 77 × 66.8 mm.

Comparative size of server processors: Altra is the largest. Photo by ServeTheHome

Comparative dimensions of server processors. Photo by ServeTheHome

Interestingly, the Mount Jade reference design uses heatsinks with a rather small contact area, about ¼ of the area of ​​the heat spreader cover on the processor itself. This allows you to roughly judge the real area of ​​the Altra Quicksilver crystal. It is, recall, monolithic and produced using 7nm standards. However, the radiators are equipped with a vapor chamber, so they should work quite efficiently and cope with a TDP of 250 watts.

The rivals for the Ampere Altra Q80-33 are AMD EPYC 7742 (64 cores, SMT2, 225 watts, $ 6950) and Intel Xeon Platinum 8280 (28 cores, SMT2, 205 watts, $ 10009). Ampere’s solution, however, is noticeably cheaper – it is priced at $ 4050. Naturally, for large customers, prices vary, but still the Ampere offer looks very attractive given the characteristics.

Turbo mode in the understanding of Ampere (left) and in the x86 world

Turbo mode in the understanding of Ampere (left) and in the x86 world

Besides, Ampere takes a different approach to “turbo mode”: if in the x86 world a certain “minimum base frequency” is adopted, which the processor can exceed, then Altra Qicksilver almost always operates at the maximum frequency declared for the model, only occasionally lowering it. But the new processors are trying to keep the heat package at the lowest possible level.

It cannot be said that the new processors were leaders in everything: in particular, AnandTech noted rather high delays, both within the same socket and interprocessor ones. Perhaps the latter is due to the need for a double conversion between the AMBA CHI and CCIX protocols. In general, interprocessor communication looks like a rather weak link in Altra: AMD has twice the width of the Infinity Fabric (64 PCIe 4.0 lanes versus 32), Intel has three UPIs, although they provide lower bandwidth, but do not have a “margin” to latency.

NAMD does not yet have compiler support, but even so the Ampere Altra are performing well

The popular HPC test NAMD does not yet have compiler support, but even so, the Ampere Altra are performing well

But in the memory bandwidth tests, Altra Q80-33 was the clear leader and clearly showed the advantages of its more flexible DRAM resource usage model. The outsider here turned out to be Xeon, which has only six channels against eight higher frequencies from AMD and Ampere.

Already in single-threaded tests SPECint2017 and SPECfp2017, the new product showed itself at least as good as the Xeon Platinum 8280, and in some cases outperformed AMD EPYC 7742. In some tests, 80 ARM cores showed themselves worse than 28 Intel cores, this lag is especially noticeable in tests for floating point calculations.

Single-threaded performance: the leader is still Xeon Scalable

Single-threaded performance: the leader is still Xeon Scalable

A relatively weak prefetch subsystem is called a possible culprit, especially since another ARM-based processor, AWS Graviton2, performed better in a similar test (507.cactuBSSN). In addition, Xeon is capable of overclocking up to 4 GHz with two active cores, which could not but affect the results.

Multi-threaded performance in SPEC2017: first place

Multi-threaded performance in SPEC2017: first place

In multi-threaded tests, Xeon, for an obvious reason, was an outsider, but Altra Q80-33 took the lead in almost all tests, except for the aforementioned 507.cactuBSSN. This is an excellent result, as the competitor AMD EPYC 7742 can execute 128 threads. In fact, in SPECint we have a new absolute leader in the class of dual-processor systems, and in SPECfp the new product is practically not inferior to its “red” rival. It is also worth noting that one Altra Q80-33 is clearly faster than Graviton2 (64 cores).

In Java tests, we failed to repeat the triumph. The immaturity of the software has affected, as well as the lack of SMT. It is also possible that the test scenarios led to saturation of the inter-core mesh network and Altra memory subsystems, but in critical scenarios, the main drawback of the new processor was the lack of multithreading.

It is not for nothing that IBM, which remains one of the main suppliers of java solutions, actively uses SMT4 and even SMT8: in such conditions, JVM-based software feels great. Probably, Marvell is also aiming at the same sphere with its ThunderX3, whose fate has not been determined. In general, however, the Altra platform was still able to take an intermediate position between the “red” and “blue”.

LLVM compilation: Phoronix version

Compiling LLVM: Phoronix Version

In compilation tests, the new product has shown itself well: for a number of observers, the compilation of LLVM Suite was about as fast as on a system with two EPYC 7742s; however, an anomaly was noted at Phoronix – Altra lost to the Xeon platform. Even so, the ARM platform showed the best energy efficiency. In compression tests, there is mainly parity between Altra and AMD, in MariaDB, nginx, and file-server scenarios the picture is the same.

But the compilation power consumption of the Ampere Altra is still the lowest

But the compilation power consumption of the Ampere Altra is still the lowest

Overall, the debut of the Ampere Altra platform can be considered promising. The new processors proved to be excellent: at a lower power consumption level, they were able to demonstrate performance around the AMD EPYC 7742 level or slightly lower, and this is in the vast majority of tests and at a lower recommended price. The new platform has some drawbacks, in particular, not a very efficient interprocessor communication scheme and lack of SMT support, but, fortunately, this did not have a fatal impact on the final performance.

New processors perform well in ray tracing

New processors perform well in ray tracing

The Wiwynn Mount Jade reference server looks very attractive. The platform proved to be quite mature: it has the best-in-class power consumption level and is able to provide users with 160 efficient processor cores, as well as a pool of RAM up to 8 TB. The main problem so far, as in the case of our test of the TaiShan ARM platform, is the lack of software optimizations and a developed ecosystem, but it’s only a matter of time.

Completely new reviews of the Ampere Altra processor and Mount Jade platform can be read at AnandTech, ServeTheHome and Phoronix. Finally, we note that NVIDIA, which is in the process of absorbing Arm, will only benefit from such platforms and is already porting its software to Arm. In particular, the same Mount Jade in conjunction with NVIDIA T4 and the NVIDIA Mellanox BlueField-2 DPU is used for cloud gaming. In addition, partnerships with GIGABYTE, Inspur and Wiwynn have been announced.