first look at Chinese ARM’ / ServerNews

first look at Chinese ARM’ / ServerNews


128 ARMv8.2-A cores with a frequency of 2.6 GHz, 512 GB of DDR4-2933 memory and 12 SAS drives in a RAID array are a promising start for exploring the new server platform and architecture, which is tasked with overcoming the x86-64 hegemony

We had the opportunity to touch the overseas wonder – the Huawei TaiShan ARM server. In a remote format, these machines are provided for testing as part of the Selectel Lab. Due to the lack of physical access and the availability of time limits, we will not go deep into the details, but will look down from the platform. As if it were an ordinary dedicated server somewhere in the cloud. But he is just not ordinary:

  • 2 × HiSilicon Kunpeng 920-7260: 64 cores, 2.6 GHz, L3 cache 64 MB, TDP 195 W
  • 512 GB RAM: 16 × 32 GB DDR-2933 ECC
  • RAID Controller LSI SAS3508
  • 12 × SAS3 10k HDD 1.2 TB
  • 2 × Huawei TM210 network controllers (LOM, 4 × 1GbE RJ-45)
  • 2 × Huawei IN200 SP580 Network Controllers (PCIe, 4 × 25GbE SFP28)
  • 2 × PSU 2 kW

Huawei TaiShan Servers

The Huawei TaiShan server series is relatively new to us only. In the native, Chinese market, they have been actively used for a long time, including in the Huawei Cloud own cloud, even before this whole story with bans. However, the breakdown in relations with major US chip manufacturers only spurred the development of its own hardware platform. And while further support for x86 systems remains in doubt, for the current ARM platform, to which other products of the company have been transferred, the future so far seems good if access to the factories is not blocked.

And we are talking about the platform. Firstly, there are several basic options – from one to four sockets, including “blades”. Based on them, configurations are created for various tasks: from edge-systems to high-density servers. All the latest storage companies work on the same base. Secondly, Huawei has chosen a strategy for working with partners in different markets, so that the basic options can be modified to a certain degree to meet local needs. For example, we already wrote about the work of NORSI-TRANS. At the current stage, one of the base TaiShan chassis has its own disk basket, backplane and corresponding software modifications.

Huawei TaiShan 2280 v2

Our test turned out to be a TaiShan 200 2280 v2 server. This is a balanced model, as the manufacturer calls it, in a 2U chassis (3D model). The standard disk cage is available in three versions: 12 × 3.5 ”SAS / SATA, 24/25 × 2.5” SAS / SATA, 8 × 2.5 ”SAS / SATA + 12 × 2.5” NVMe. RAID is third-party – LSI SAS3x08 with optional BBU. The list of compatible models also includes Microsemi PM82xx and LSI 94×0-8i. The regular controller is made in the form of a mezzanine, so it does not occupy a PCIe slot. To install other expansion cards – Infiniband, Ethernet, Fiber Channel, HBA, SSD – three IO-modules are provided in the chassis. You can get a maximum of 8 PCIe 4.0 x8 slots, or 3 PCIe 4.0 x16 slots + 2 PCIe 4.0 x8 slots.

But in general, there are a lot of different layout options for IO modules. There are options for installing additional 2.5 ”/ 3.5” drives, there are also combo boards, not just risers. So it’s better to read the documentation to understand the possible combinations. With LOM cards called FlexIO here, the situation is simpler. There are two slots for them – and there are also two types of modules. Huawei TM210 includes four 1GbE-RJ-45, TM280 ports – four 10 / 25GbE SFP28 ports. Between the FlexIO slots is a block with iBMC ports.

Power supply units 80+ Platinum, as usual, two (1 + 1) – up to 2 kW each. Although the basic platform does not consume so much, as we will see later. At the same time, it is completely unclear from the documentation whether the PSU has additional power cables for expansion cards. On the other hand, in the list of compatible “GPUs”, for example, there is only the Atlas 300 accelerator based on the Ascend 310 chip, which needs only 75 W from the PCIe slot.

In general, the chassis, perhaps a little unusual, but quite modern. The question of compatibility of equipment not listed in the documentation remains open, but Huawei suggests contacting local representatives for an answer to it. Again, if we consider TaiShan as a platform, then for specific tasks, it may well be “doped” to support the necessary hardware components.

HiSilicon Kunpeng 920

Kunpeng 920 processors were officially introduced at the beginning of last year, although we saw them back in 2018 under the name HiSilicon Hi1620. And at that time, the company stated that these are the most productive ARM chips in the world: 7-nm TSMC process technology, from 24 to 64 ARMv8.2-A cores with a frequency of 2.6 GHz, 1 MB of L3 cache per core, 40 PCIe 4.0 lanes with CCIX support, 4 or 8 DDR4-2933 ECC channels (up to 2 TB of RAM in total).

It is possible to create seamless two- and four-socket configurations using the Hydra bus (30 GT / s, 240 Gbit / s), each CPU has three lines on which. The processor itself consists of chipsets and also includes two 100GbE units with RoCEv1 / v2, 16 SAS 3.0 channels, two SATA-III channels and four USB 3.0 ports. And all this in BGA packaging, that is, replacing only the CPU in case of failure, and not the entire board, is not provided.

The set of features is very good, but at the time of the announcement – it’s completely impressive. But now the picture is slightly spoiled by TDP – the heat package of the older model is approaching 200 watts. Released in August, AMD EPYC Rome cooled the ardor of fans to predict the imminent and imminent demise of x86-64. However, we will not dig into microarchitectural details now.

We note only some confusion in the labeling of processors. In the documentation and promotional materials indexes 7260, 52×0 and 3210 are used, indicating a different number of cores and memory channels. For the TaiShan series, Kunpeng 916 CPUs are also available (aka HiSilicon Hi1616). However, the BIOS and other places use a different numbering. In particular, our Kunpeng 920-7260 has been designated 920-6426. The last two pairs of digits indicate the number of cores and frequency. Yes, nothing is said about dynamic frequency control.

BIOS and iBMC

The BIOS may seem a little poor at first glance on the options, but it’s rather out of habit – simply because it is different. The peculiarity of modern ARM platforms is that they don’t need a “classic” BIOS, only UEFI. And ARM had to make an effort by launching a separate ServerReady program aimed at improving the compatibility of equipment, UEFI and drivers, firmware and software, and so on. Simply put, bring the state of server platforms to the same level that x86 has.

To run the OS installer correctly from the ISO image, only one parameter had to be changed. Well, at the same time, we checked that the power policy is set to Performance (“Performance”). The remaining settings are left in their original form. Separately, it is worth noting that the default is NUMA (four domains), but the distribution of domains can be reconfigured.

Again, the iBMC (Intelligent Baseboard Management System) looks a little unusual, but in fact it turns out to be a very advanced system of remote control and platform monitoring. You can get acquainted with its capabilities using the simulator. It does not implement one hundred percent functionality, but a general idea can be made.

Among the curious and useful features include integration with LDAP, support for multi-user work, two-factor authentication (certificate + password), good monitoring and notification of incidents, as well as data collection, including screenshots and video recording. True, to monitor some parameters, it is necessary to install drivers and software already in the OS itself.

There is RMCP / RMCP + for management, Redfish is also mentioned in the documentation, and a classic Java applet and HTML5 console are provided for iKVM. There is even a VNC. The HTML5 console has support for ISO / FDD images and the transfer of keyboard shortcuts – maybe it’s not too comfortable, but you can control the machine remotely.

For compatible devices, there is not only the display of parameters, but their management. In particular, you can build an array on a RAID controller in the web interface (well, or in the BIOS). In our case, it was an LSI SAS3508 along with twelve Seagate EXOS 10E2400 SAS3 drives – ST1200MM0009, according to iBMC. The drives were collected in three arrays: two “mirrors” of two disks for the OS + one RAID-10 for storing data from the remaining eight.

Well, the most interesting thing in iBMC is to observe the server’s energy consumption graphs in almost real time. Huawei emphasizes, among other things, the cost-effectiveness of TaiShan compared to other platforms – the iBMC home page even shows how much energy has been saved and how much the carbon footprint has been reduced. In our case, the maximum peak consumption came close to 600 W, and the minimum average was slightly more than 360 W.

Linux and software installation

In the development and porting of software, Huawei relies heavily on partners and just third-party developers, which the company is ready to stimulate financially. Among Russian distributions, Alt Linux, Astra Linux, and RED OS are working on TaiShan support. Huawei itself has prepared a free openEuler distribution based on the commercial EulerOS, which, in turn, is based on CentOS / RHEL. There is no support for other operating systems and hypervisors except Linux-based.

The first openEuler 20.03 LTS release was released in March, so when searching for solutions to problems or questions, you can often find answers only in Chinese forums or in the Huawei community itself. There is no formal list of compatible distributions anywhere, but the documentation contains instructions for Ubuntu 18.04 (with an HWE kernel), SLES 15, and CentOS 7.6. However, AArch64 builds of popular distributions have been available for several years.

For the test, openEuler 20.30 was chosen as the official distribution, which has a number of optimizations for Kunpeng 920 “out of the box”, and Ubuntu 19.10 with the subsequent update “live” to 20.04 (released after the start of testing), which have fairly fresh kernels.

Installing Ubuntu 19.10 from an ISO image through iKVM went smoothly. Network interfaces were defined correctly, so, probably, the netinstall version could be dispensed with. For work, nothing extra had to be configured. Further updating by regular means until 20.04 was also successful, except for a few lost symlinks and parameters that are fixed by a couple of commands. And this is hardly the fault of the platform. Unlike openEuler, the kernel did not know anything about the name of the CPU model, which, however, did not affect performance. All OSs showed an odd L3 cache size.

The openEuler installation also went without any special adventures. After that I had to manually register the repositories in the settings, since by default no one was specified, and the documentation generally recommends using the ISO image of the installation DVD as the main source. Only official project repositories were used. And then the first and, in general, quite expected slight disappointment awaited us.

With its EPOL, the distribution is clearly trying to catch up with EPEL in terms of the number of ready-made packages, but so far this is not entirely possible. For some packages you can find some alternative, for others – no. However, most of the basic software, at first glance, is there, but you still have to assemble something yourself.

Testing

For testing, we used the Phoronix Test Suite 9.6 (PTS) package. As usual, ext4 was used for the partition of the OS itself, and / var, from where all the tests were run, lay on the second section with xfs. The mounting options did not change, but in the case of openEuler, at the insistence of PTS, the low_latency option was disabled for the BFQ I / O scheduler, since it gives priority to the delay to the detriment of the bandwidth, which is not very necessary for us. Ubuntu 19.10 and 20.04 used regular generic kernels of branches 5.3 and 5.4, respectively, and the complete set of compilers included GCC 9.2.1 and 9.3.0.

For openEuler, the bundle included its own kernel 4.19 and GCC 7.3.0. Among the features of the release, the presence of optimizations of the libraries, the kernel itself and OpenJDK is noted. Nothing is said separately about any proprietary patches of the regular compiler, although it would be just an ideal bundle, since PTS just collects all the software locally from the source codes. As a basis, we took last year’s AMD EPYC 7002 test suite.

Naturally, the final set of tests was different, but for completely different reasons. Some software packages are, in principle, not designed for AArch64, and with them it is just the easiest way – you do not even need to try to assemble them. With others, things can be more complicated. For part of the software, it quickly became clear that he was actively using libraries and optimizations strictly for x86-64, so he was just not going to.

The other part of such a tight binding may not have, but there are still problems with the assembly. Somewhere literally one #define is missing, somewhere the wrong key for the compiler is registered. In the case of a small project, this is easily fixed manually. In the case of a large one, where even several Makefiles per thousand lines are already collected by scripts, this is a dubious pleasure. At the same time, not only the compiler keys differ: other utilities may also not accept some parameters.

Finally, the most unpleasant thing is when the software is assembled, but does not work, without giving results or falling with an error. Especially when this does not happen immediately after launch. And if it seems that this does not concern the interpreted languages, then alas, there may not be any modules or libraries. We had to face all these features. OpenEuler turned out to be the worst in terms of program assembly, including due to the lack of a number of software in the repositories.

On the other hand, openEuler ended up being the leader in 61% of the tests done. True, on average, the difference is not so great – 5% faster than Ubuntu 19.10 and 1% faster than 20.04, and the distribution is uneven. For example, the Canonical product was the leader in working with the drive and the DBMS, and the development of Huawei proved to be better in compilation tests, when working with memory and in Java. At the same time, there were simply significant gaps and peak emissions. In particular, in Stream, openEuler turned out to be almost twice as slow, but in CacheBench, writing was twice as fast. Rendering in Java is three times faster. But the native render engines are almost the same in speed.

In general, we can say that working in Ubuntu turned out to be more pleasant and easier. Hots would be because I did not have to collect or search for binary files of the necessary auxiliary utilities for the PTS itself. Full results are presented in this file. Separately, we note that it is necessary to treat them just like preliminary tests. In a year, I would like to see how the openEuler package base will expand and what development tools will be included. Alas, this time we had to leave aside virtualization and containerization, as well as an intelligent system profiling a-tune .

But with KAE, a hardware accelerator, the modules and libraries for which are in openEuler, it turned out curious. KAE supports Chinese SM3 / SM4 algorithms and the more common RSA / AES, as well as operations (de) compression and some others. Normal documentation is available only in Chinese, while the English description is limited only to the description of encryption operations. A license is required for work, which we did not have. But it seems that AES acceleration works like that. In any case, with the parameters from the examples for blocks up to 256 KB with the KAE engine, there is almost a fourfold difference in speed.

In order to at least roughly figure out how the Kunpeng 920 looks compared to AMD EPYC and Intel Xeon, let’s compare our tests with this result – there were also very few matching test packages, so you should not make final conclusions on them. From the total mass, several processors were selected that are closest in the number of threads, and not just cores, since Kunpeng does not support SMT. As a result, we got the following set:

  • 2 × AMD EPYC 7601 (32/64, 2.2 / 3.2 GHz, L3 cache 64 MB, TDP 180 W)
  • 2 × AMD EPYC 7502 (32/64, 2.5 / 3.35 GHz, 128 MB L3 cache, TDP 180 W)
  • AMD EPYC 7742 (64/128, 2.25 / 3.4 GHz, 256 MB L3 Cache, TDP 225 W)
  • 2 × Intel Xeon Platinum 8280 (28/56, 2.7 / 4.0 GHz, 38.5 MB L3 Cache, TDP 205 W)
  • 2 × HiSilicon Kunpeng 920-7260 (64, 2.6 GHz, L3 cache 64 MB, TDP 195 W)

You can see a small comparison with our last year’s AMD EPYC 7742 test results, which are similar in spirit to the current data in the sense that at the time of testing all the patches for the kernel and compilers were not ready. Naturally, EPYC turned out to be more powerful – Kunpeng running openEuler averaged a quarter slower. Full comparison results are presented in this file.

Conclusion

In general, as a result of a brief acquaintance, we liked the platform. The matter remained for ma … No, for small, but for mass optimization of software and development of development tools. As part of openEuler, in particular, plans include further optimization of OpenJDK, moving to GCC 9.3.0, and work on LLVM. They promise to give all the achievements to the main branches of projects. In this matter, the release and mass distribution of cheap motherboards on simpler processors should partially help.

We still have one, but very important unknown – the price. And the price is not only for hardware, but – at least for now – the price for software porting. TaiShan, if you believe the open source and the manufacturer’s promises, is cheaper than the similar massive x86-64 2S solutions (the situation is different with 4S and blades). For a single server, this is not so critical, but on a scale from the rack onwards it may already be interesting. And the potential savings in one form or another – this, in our opinion, is the only reasonable argument in favor of ARM. Near-political stories about import substitution or security do not count.

Huawei was literally forced to switch to another architecture, but could it pull the rest of the world with it? When SMB users come to the store and think hard, is it worth taking an ARM server? Hardly soon. So far, the obvious strategy is to adapt to specific projects, tasks and workloads of interested customers, as well as community development.

P.S.: материал был подготовлен до введения очередного пакета ограничений, который может коснуться TSMC, где производятся процессоры HiSilicon Kunpeng.