Recently there appeared an article describing a high-speed random number generator (claimed to be the fastest in the world).

It is clear that its author does not read Russian-language articles on the topic, but if he had read, he would not have argued about his world championship generator in speed. The pseudo-random number generation speed is 12 Gigabytes/sec. was achieved a long time ago. This generator is used to generate encryption keys.

The article will further describe an ultra-long pseudorandom number generator. It is based on the cryptographic transformation Russian Roulette.

The generator generates numbers 8192 bytes long at a speed of 12 Gigabytes per second in one stream (on one processor core).

The generator is assembled on the basis of the cryptographic transformation "Russian Roulette", which was specially "tailored" for high-speed random number generation. For this, the encryption key input scheme is excluded from it and, accordingly, the conversion reversibility is removed to form a cryptographic sequence of pseudorandom numbers.
The test sequence (100 Megabytes) obtained as a result of the generator completely passes the statistical tests NIST, DieHard.

Tests were carried out by the method of sequential generation of pseudorandom numbers from the initial monotonic consecutive eight-kilobyte numbers.

The original numbers contain zero bytes, except for the counter, which is located in the lower 8 bytes.

Basic cryptographic conversion scheme


Cryptographic transformation is assembled according to a ring scheme of 16 modules with non-linear properties. Each module performs an operation on 16 bytes, in just one round 256 bytes are modified. The nonlinear module uses half of the ymm register (16 bytes) as a drive.

The combination of drives in a 256-byte number is implemented using the operation of bit summation (XOR) in two versions.

The first transformation can be considered as a modification of the Feistel scheme, where the summation operation does not occur with the previous value of a single converter, but with the current value of a neighboring nonlinear module in a ring structure.

The circuit with the closure of the drives in the ring modifies the values ​​of all positional bits with the propagation of state changes in them. The state change propagation coefficient for one positional bit is always 2.

Here's how it is implemented, the ymm0-ymm7 registers contain 16 drives:

vperm2i128 ymm10,ymm0,ymm0,001h; поменять местами 1 и 16 накопители vpxor ymm0,ymm0,ymm1; vpxor ymm1,ymm1,ymm2; vpxor ymm2,ymm2,ymm3; vpxor ymm3,ymm3,ymm4; vpxor ymm4,ymm4,ymm5; vpxor ymm5,ymm5,ymm6; vpxor ymm6,ymm6,ymm7; vpxor ymm7,ymm7,ymm10; замкнуть кольцо модификаций 

A change in the state of a single bit leads to a guaranteed 50% change in all bits in all drives for 32 rounds of operation of the ring circuit, so the minimum block size is 32 * 256=8 KiloBytes.

The second transformation uses a cumulative scheme. Modification of drives in each round is made so that they summed the contents of 2 to 8 drives.

Here is the code for this conversion:

vpxor ymm1,ymm1,ymm0; vpxor ymm2,ymm2,ymm1; vpxor ymm3,ymm3,ymm2; vpxor ymm4,ymm4,ymm3; vpxor ymm5,ymm5,ymm4; vpxor ymm6,ymm6,ymm5; vpxor ymm7,ymm7,ymm6; vperm2i128 ymm8,ymm7,ymm7,001h; поменять местами 1 и 16 накопители vpxor ymm0,ymm0,ymm8; замкнуть кольцо модификаций. 

A change in the state of a single bit in this circuit leads to a guaranteed 50% change in all bits in all drives in 16 rounds of operation of the ring circuit.

Next, a cryptographic conversion and a procedure for generating a pseudo-random number of size 8 Kilobytes will be described.

Non-linear conversion module


The non-linear conversion module is in many respects similar to that used in the Russian Magma encoder (formerly GOST 28147-89), but it is 16 bytes in size. Non-linear module performs permutation and annular shift operations.

First, a permutation is performed, indexes of 16 byte permutation are specified in the algorithm, and for each module its own permutation is used.

Then a circular shift of the drive fragments is performed, the shifts are used of two types.
The first type uses fixed shear coefficients, for each drive, they are set in the algorithm. Ring shifts are performed on 2 byte fragments of the drive. Inside each drive, all shifts are the same.

The second type of ring shift uses as a shift coefficient a variable value from the drive shifted in the ring of nonlinear transducers by 8 positions.Ring shifts are performed on 4 byte fragments of a 16 byte drive. The shift coefficients are taken from the first byte of each 4 byte words.

Accordingly, in such a module, 4 shifts are used with unpredictable and different shear coefficients.

The cyclic shifts themselves implement the partial inversion function. If there is a single bit in the upper bit of the shifter, then the higher bits are inverted, the number of inverted bits is equal to the number of the shift coefficient.

Here is a variable ring shift module implementation:

vperm2i128 ymm8,ymm0,ymm0,001h; запомнить с обменом местами накопители vpshufb ymm0,ymm0,[r14+0100h]; две перестановки по 16 байт каждая vpand ymm8,ymm8,[r14]; обнуление в двойных словах разрядов 5-31 vpsubb ymm12,ymm15,ymm8; обратное значение для кольцевого сдвига vpsravd ymm8,ymm0,ymm8; сдвиг с размножением знакового разряда vpsllvd ymm0,ymm0,ymm12; сдвиг на освобожденные места vpxor ymm0,ymm0,ymm8; суммирование с частичным инвертированием 

The register ymm15 contains the value 020h in each double word
Address [r14 + 0100h] contains permutation indices
The address [r14] contains eight shift masks with the value 01fh

Here is the implementation of the module with a fixed annular shift:

vpshufb ymm1,ymm1,[r14+0100h]; две перестановки по 16 байт каждая vpsraw ymm14,ymm1,5; сдвиг с размножением знакового разряда vpsllw ymm1,ymm1,16-5; сдвиг на освобожденные места vpxor ymm1,ymm1,ymm14; суммирование с частичным инвертированием vpxor ymm1,ymm1,ymm2; суммирование накопителя с предшествующим 

If we calculate the total number of ring shifts in one round of transformation, then there will be 4 * 16=64 for variable shifts and 8 * 16=96 for fixed ones. Therefore, such crypto conversion was called Russian Roulette .

Generating 4 Kilobyte bytes from 256 Byte


Having a 256-byte generator, getting 8 kilobytes is not an easy task. Form 32 numbers of 256 Bytes in size and say that this 8 Kilobyte number will not work.

Let me explain with a simple example.

Suppose we have several pseudo-random number generators with a bit capacity of 32 bits (a total of 4 billion combinations). No matter how we combine them into a large number, still in 4 billion rounds this large number will be repeated completely...

But that’s not all.

There is another artificiality in the formation of numbers of large length from shorter ones. The values ​​given by any pseudo-random number generator are interconnected by a specific and unambiguous dependence, when subsequent values ​​are always functionally dependent on previous ones.

We need to connect all the pseudo-random components of a large number with dependencies so that the value of any component affects all the others, and not just the subsequent ones.
Simply put, such a number should not have a start or end component, they should all be in an inextricable ring.

Therefore, you need to make a generator with a full capacity of 8 Kilobytes, using a shorter crypto conversion as a basic component. To do this:

  • The original 8 Kbyte buffer is encrypted in 256 Byte blocks
  • encryption performed according to the gamma-feedback scheme
  • such encryption is performed in two passes (around the ring)
  • in the first pass, modules with variable ring shifts and drives attached to non-linear modules are used
  • in the second pass, fixed ring shifts and ring shear drives relative to nonlinear converters are used.

In such a scheme, the state of any bit depends on all other bits in the output number.
Different cryptographic transformations in the encryption passes are needed for a more uniform distribution of the statistical parameters of such a large number.

Such a scheme for generating pseudorandom numbers gives a sequence with statistical parameters that fully satisfy the criteria of randomness according to the NIST version (there is no loss from the given ranges of proportions and the value of P does not fall below 0.002xxx)...

NIST tests are designed for 64 bit generators and they, with default settings, average the parameters for 1,500,000 numbers in a test sequence. In our case, numbers 8K long are generated, and the statistical parameters are averaged for only 1,500 numbers. Nevertheless, even with such a small sample of values, it is possible to successfully pass statistical tests...

This ensures a random number generation rate of 12 Gigabytes/sec.

Implementation in a test program


This rather cumbersome two-pass design is used to form 8 kilobyte pseudorandom numbers with a speed of at least 12 Gigabytes/sec.
Sources and a compiled program for generating pseudo-random numbers is available via links.

The phenomenal speed is achieved due to the large bit depth of the registers (32 bytes) and pipeline (parallel) processing of commands.

The test generator uses the AVX2 commands, so the speed depends on the specific processor model, but it may not work on "old" processors, on "new" processors the speed will be much higher than the declared 12 Gigabytes/sec.

On Intel processors of the sixth generation (testing was performed on them), the speed of the generator was determined primarily by the memory bandwidth, and not the processor frequency.

At the moment, it is the fastest pseudo-random number generator with the largest digit capacity of the generated numbers and the highest quality statistical parameters according to NIST.

Here is a triple record holder...

Well, according to tradition, the authors assign names to all pseudorandom number generators, we will not deviate from this tradition and will call it RU-lette, on behalf of the ancestor, and in Russian it turns out that his name is ru-years...

Source