Russian processors Elbrus-16S, Elbrus-12S and Elbrus-2S3 will receive cores of the sixth generation of E2K architecture / ServerNews

Russian processors Elbrus-16S, Elbrus-12S and Elbrus-2S3 will receive cores of the sixth generation of E2K architecture / ServerNews


At the Elbrus Tech Day event, MSCT spoke about the current achievements and plans for the development of a series of Russian Elbrus processors. Now the most modern CPU of this line is Elbrus-8SV based on the E2K architecture (Elbrus 2000) of the fifth generation, but in the coming years, three SoCs of the sixth generation will appear at once: Elbrus-16S, Elbrus-2S3 and Elbrus-12S.

Elbrus-8SV is an evolutionary development of Elbrus-8. Both chips use a 28nm process technology, but due to optimizations in 8CB, they managed to raise the frequency, which, together with support for wide vector instructions and a more modern memory standard, gave a twofold increase in theoretical peak performance. However, for programs that do not use SIMD, the gain is proportional to the increase in the clock frequency + they still benefit from the increase in memory speed.

On the basis of these and other processors, MCST develops reference designs of motherboards of various form factors that can be licensed for further customization. Some of the company’s partners develop their own motherboards and products based on them. Soon, TSMC will place an order for the production of the next batch of Elbrus-8SV in the volume of 10 thousand pieces. In general, a fairly noticeable ecosystem of both hardware and software products and solutions has developed around the existing CPUs.

The next generation of processors will be more diverse. In addition to the 16-core Elbrus-16S, focused on high-performance server systems, there will be a simpler model that will appear later than the rest – Elbrus-12S. This 12-core CPU is designed for entry-level servers as well as workstations. And the main difference from 16C will be in the price. Finally, another chip, the dual-core Elbrus-2S3, is aimed at mobile systems, including tablet computers.

Elbrus-16S

All chips will be manufactured at TSMC using 16nm FinFET process technology and will be based on the sixth generation of the E2K architecture. Strictly speaking, these are no longer processors, but full-fledged SoCs with integrated controllers for various peripherals, and they do not need an external southbridge chip to work, as it was before. In the case of Elbrus-16S, the crystal area is 618 mm2 (25.3 × 24.4 mm), it is packed in an HFCBGA4804 case with dimensions of 63 × 78 mm. The crystal contains 12 billion transistors, and its power does not exceed 130 W.

A significant part of the changes in the architecture affected the memory subsystem. In particular, the sizes of the caches were increased, the total volume of which reached 51 MB: the total L3 cache for all 32 MB, the L2 cache increased to 1 MB, the L1 cache for instructions by 128 KB + L1 data cache by 64 KB. The memory controller became eight-channel, received support for DDR4-3200 and 2DPC modules, which gives up to 4 TB of RAM per socket with a total bandwidth of up to 200 GB / s.

The first engineering samples of Elbrus-16S, obtained at the end of last year, already show a speed of about 70-80% of the maximum possible in the stream benchmark. The controllers are connected in pairs to four agents (HMUs), “attached” to an internal mesh bus with a bandwidth of 2 TB / s, which combines memory and cores. The chip can be split into two or four NUMA domains, which is useful for a number of tasks.

One of such tasks is virtualization, and in Elbrus-16C it finally became full-fledged – the new processors support hardware virtualization of almost all important resources, including for the x86 translation mode, which also has not gone anywhere. For CPUs of previous generations, containerization can still be used, but MCST is also preparing a paravirtualized kernel and related components, including KVM, QEMU, libvirt and virt-manager.

For the cores themselves, the microarchitecture was redesigned, which gave an increase in the speed of work and new opportunities. In particular, new SIMD instructions appeared in addition to the existing ones, support for FMA according to the IEEE 754-2008 standard (required in modern C standards), dynamic optimization (regarding scheduling, which is important for VLIW), a new interrupt controller (required for virtualization) and etc.

The peak theoretical performance of the core is 96 Gflops for single precision calculations and 48 Gflops for double precision calculations. For the entire CPU, this is 1.5 teraflops and 768 Gflops, respectively. Preliminary tests show a performance increase of 2-2.5 times in comparison with Elbrus-8SV, but we must remember that a lot depends on optimizations from the compiler. The kernel itself, although it has become more complex, is still simpler than the kernels of modern x86-64 processors.

The weak point of the new chips, in our opinion, is the IO block. The SoC includes four PCIe 3.0 root complexes for a total of 32 lanes. Of these, 8 or 16 lines can be allocated for connecting an external south bridge, if there is not enough of what is built into the chip itself. It provides 2 SATA 3.0 ports, 4 USB 3.0 / 2.0 ports and two multi-ports giving either a SATA pair or Ethernet pair with a maximum configuration of 10GbE + 2.5GbE.

Another 8 PCIe lanes can be given per interprocessor communication (IPL) channel, in addition to the two always present. In a two-socket system, it is thus possible to link the CPUs with two or three IPLs. True, the speed of one such channel is only 12 Gb / s (on engineering samples they have so far reached 10 Gb / s), which is much less than that of UPI or Infinity Fabric. A total of up to four processors can be combined in one system.

Among other things, the chips are equipped with various RAS-functions to improve operational reliability. Also improved monitoring of the processor and its power and cooling management. Probably, now all systems based on new CPUs will be equipped with a BMC controller – ASPEED AST2500 and, in the future, AST2600 – with its own firmware based on OpenBMC and with an embedded micro-OS that simplifies initialization and work with equipment. The reference design of the two-socket 2E16C-SPRC board will appear in the middle of this year, and the single-processor Micro-ATX – by the end.

In 2022, other versions of two- and four-socket systems with Elbrus-16S, as well as one- and two-socket boards for Elbrus-12S, will appear. Partners of MCST, presumably, will also not sit idle. Recall that the formal completion of the Elbrus-16S development is scheduled for the end of this year. For Elbrus-2S3 and Elbrus-12S, the exact dates were not announced. And if the 12-core model is most likely very similar to the 16-core model, then the junior chip of the series is noticeably different from them.

Elbrus-2S3 has only two sixth-generation cores with a clock frequency of 2 GHz, two DDR4-3200 memory channels and performance up to 192/96 Gflops FP32 / FP64. It has 16 PCIe 3.0 lanes. It includes a 3D core Imagination PowerVR GX6650 (300 Gflops), a number of (de) video encoders, as well as a 2D core of our own design. There are four video outputs (of which 2 are HDMI) and support for 4K output. For this SoC, the company will prepare the first Micro-ATX and Mini-ITX boards during 2021.

The characteristics of the future Elbrus-32C processors have not yet been fully determined, but the approximate outlines of the future product are already there. The CPU must have a performance of at least 1.5 / 3/6 Teraflops for FP64 / FP32 / FP16 computing and contain at least 32 cores with a frequency of more than 2 GHz. Perhaps there will be 64 cores of the seventh generation E2K. The amount of L3 cache should at least double, and the memory controller may receive support for DDR5 at least 4 TB / socket. It is assumed that at least two-socket configurations can work.

Virtualization and proprietary secure computing technology can be further developed with the addition of new instructions. Already, developers want to provide 64 PCIe 5.0 lanes, which paves the way for the use of CXL 2.0. To the built-in controllers, in addition to NVMe, which you definitely cannot do without, 100GbE and USB 3.1 or newer can be added. Future crystals will switch to a technical process no thicker than 7 nm, and their area will grow to 600 mm2