Hello everyone!


The recent article on Habré once again showed a steady interest in the game "Life" in particular and all kinds of optimizations in general. The article and comments on it, especially the curiosity about GPU computing, inspired me to share my research in this field and, looking ahead, I’ll say that the story will focus on GPU calculations, bit magic, multithreading and huge fields for the game “ Life, "on the order of a billion cells.

ITKarma picture


About me


A few words about yourself and the project. For several years now, my hobby and pet project has been writing the Life game, during which I understand the topics that interest me. So, first I made it using the glut library and screwed up the multi-user mode, being inspired by the peer-to-peer architecture. And about two years ago I decided to start everything from scratch, using Qt and GPU computing.

Scale


The idea of ​​the new project was to make a cross-platform application, primarily for mobile devices, in which users can plunge into the Life game, get acquainted with a variety of various patterns, observe the bizarre shapes and combinations of this game, choosing a speed from 1 to 100 ms per step and setting the size of the playing field up to 32'768 x 32'768, i.e. 1'073'741'824 cells.

Just in case, let me remind you of the game "Life". The game takes place on a plane divided into many cells. The cells are square, respectively, for each cell there are 8 neighbors. Each cell can be either alive or dead, that is, empty. As part of the game, the user fills the cells with life, exposing on the field the combinations of cells that interest him - patterns.

Next, the process develops step by step according to the given rules:

  • in an empty cell, next to which exactly 3 living cells, life is born
  • if a living cell has 2 or 3 living neighbors, then this cell continues to live
  • if a living cell has less than 2 or more than 3 living neighbors, the cell dies

The discerning reader can guess that a naive implementation will not be able to cope with such a volume of computations, and therefore all kinds of optimizations were used.

In my implementation, the field is looped, that is, after the 32'767th cell, the 0th one again follows. Thus, moving patterns are able to go around the field and be at the point where they were originally placed. A funny effect happens with the “glider gun” pattern, when the gliders released by it ultimately crash into the gun itself and destroy it, a kind of suicide in the world of cellular automata:

Gosper glider gun


Architecture


As for the implementation, first of all I want to note that each cell in the playing field represents only 1 bit in memory. So, when choosing the size of the playing field, a buffer of size n ^ 2/8 bytes is allocated in the memory. This improves data locality and reduces the amount of memory consumed, and most importantly allows you to apply even more ingenious optimizations, which we will discuss below. The playing field is always square and its side is always of degree 2, in particular then to divide by 8. Without a remainder.

Two layers responsible for computing are architecturally distinguished:

  • low level - platform dependent; currently there is an implementation on Metal API - Apple’s graphical API allows you to use the GPU on devices from Apple, as well as a fallback implementation on the CPU. At one time, there was a version on OpenCL .The implementation is planned on Vulkan API to run on Android
  • high level - cross-platform; essentially a class-template method, delegating the implementation of the necessary methods to a low level

Low


The low-level task is directly to calculate the state of the playing field and is arranged as follows. Two buffers n ^ 2/8 bytes are stored in memory. The first buffer serves as input - the current state of the playing field. The second buffer is like output, into which the new state of the playing field is recorded in the calculation process. Upon completion of the calculations, it is enough to swap buffers and the next step of the game is ready. This architecture allows you to parallelize the calculations due to the constancy of input. The fact is that each thread can independently process the cells of the playing field.

Due to multithreading, the GPU approach achieves maximum efficiency. Parallelization also occurs on the CPU, but, accordingly, with a much smaller number of threads and lower efficiency, which is reflected in the benchmark section below. The algorithm itself is identical for all implementations and consists in the following.

We call the atomic operation the calculation of one byte of the buffer that stores the playing field, that is, the calculation of eight cells. This calculation is performed in a single thread and requires information about the neighboring eight bytes. Let's try to schematically depict one byte as follows, where the dot is the bit:

[........] 

And the plot of the playing field with the byte of interest to us and its eight neighboring bytes will be represented as follows:

[........][........][........] [........][********][........] [........][........][........] 

Based on the figure, we can guess that all neighboring cells can be packed into one uint64_t (let's call it neighbors), and another uint8_t (let's call it self), information about neighbors will be stored in the processed byte.

Consider, for example, the calculation of the 0th bit of the target byte. With an asterisk we mark the bit of interest, and by zero the bits adjacent to it:

[.......0][00......][........] [.......0][*0......][........] [.......0][00......][........] 

An example is more complicated. Here, an asterisk denotes the 0th, 3rd, and 7th bits of the target byte and the corresponding numbers of adjacent bits:

[.......0][00333.77][7.......] [.......0][*03*3.7*][7.......] [.......0][00333.77][7.......] 

I think some of the readers have already guessed what the magic is. Having the specified information, it remains only to compile bit masks for each bit of the target byte and apply them to neighbors and self, respectively. The result will be 2 values, the sum of the single bits of which will characterize the number of neighbors, which can be interpreted as the rules of the “Life” game: 2 or 3 bits to maintain life in a living cell and 3 bits to generate new life in an empty cell. Otherwise, the cell remains/becomes empty.

The described calculations at the lower level occur every step of the game, and the field is always processed in its entirety, regardless of its fullness and without tricks with memoization a la Hashlife .

After filling the entire output buffer, the playing field is considered calculated.

Metal shader code for processing 1 byte
//Бросается в глаза C++-подобный синтаксис #include <metal_stdlib> #include <metal_integer> using namespace metal;//Вспомогательные функции для вычисления позиции клетки ushort2 pos(uint id) { return ushort2(id % WIDTH, id/HEIGHT); } uint idx(ushort2 pos) { return pos.x + pos.y * HEIGHT; } ushort2 loopPos(short x, short y) { return ushort2((x + WIDTH) % WIDTH, (y + HEIGHT) % HEIGHT); }//Битовые маски для вычисления соседей интересующего бита template<uint Bit> struct Mask { constant constexpr static uint c_n_e_s_w=0x70007 << (Bit - 1); constant constexpr static uint c_nw_ne_se_sw=0x0; constant constexpr static uint c_self=0x5 << (Bit - 1); }; template<> struct Mask<0> { constant constexpr static uint c_n_e_s_w=0x80030003; constant constexpr static uint c_nw_ne_se_sw=0x80000080; constant constexpr static uint c_self=0x2; }; template<> struct Mask<7> { constant constexpr static uint c_n_e_s_w=0xC001C0; constant constexpr static uint c_nw_ne_se_sw=0x10100; constant constexpr static uint c_self=0x40; };//Для указанного бита функция вычисляет состояние клетки в зависимости от её соседей, применяя соответствующие биту маски template<uint Bit> uint isAlive(uint self, uint n_e_s_w, uint nw_ne_se_sw) {/* [.......0][00333.77][7.......] [.......0][*03*3.7*][7.......] [.......0][00333.77][7.......] *///До определённой версии в Metal не было 64-битного целого, поэтому составляются две маски uint neighbours=popcount(Mask<Bit>::c_n_e_s_w & n_e_s_w) + popcount(Mask<Bit>::c_nw_ne_se_sw & nw_ne_se_sw) + popcount(Mask<Bit>::c_self & self); return static_cast<uint>((self >> Bit & 1) == 0 ? neighbours == 3 : neighbours == 2 || neighbours == 3) << Bit; }//Язык Metal даже умеет в шаблонную магию template<uint Bit> uint calculateLife(uint self, uint n_e_s_w, uint nw_ne_se_sw) { return isAlive<Bit>(self, n_e_s_w, nw_ne_se_sw) | calculateLife<Bit - 1>(self, n_e_s_w, nw_ne_se_sw); } template<> uint calculateLife<0>(uint self, uint n_e_s_w, uint nw_ne_se_sw) { return isAlive<0>(self, n_e_s_w, nw_ne_se_sw); }//Главная функция compute-шейдера. На вход подаются два буфера, о которых речь шла выше - константный input и output, а также id - координата целевого байта kernel void lifeStep(constant uchar* input [[buffer(0)]], device uchar* output [[buffer(1)]], uint id [[thread_position_in_grid]]) { ushort2 gid=pos(id * 8);//Вычисляем соседние байты uint nw=idx(loopPos(gid.x - 8, gid.y + 1)); uint n=idx(loopPos(gid.x, gid.y + 1)); uint ne=idx(loopPos(gid.x + 8, gid.y + 1)); uint e=idx(loopPos(gid.x + 8, gid.y )); uint se=idx(loopPos(gid.x + 8, gid.y - 1)); uint s=idx(loopPos(gid.x , gid.y - 1)); uint sw=idx(loopPos(gid.x - 8, gid.y - 1)); uint w=idx(loopPos(gid.x - 8, gid.y ));//Вычисляем байт с целевым битом uint self=static_cast<uint>(input[id]);//Подготавливаем битовые маски с соседями//north_east_south_west uint n_e_s_w=static_cast<uint>(input[n >> 3]) << 0 * 8 | static_cast<uint>(input[e >> 3]) << 1 * 8 | static_cast<uint>(input[s >> 3]) << 2 * 8 | static_cast<uint>(input[w >> 3]) << 3 * 8;//north-west_north-east_south-east_south-west uint nw_ne_se_sw=static_cast<uint>(input[nw >> 3]) << 0 * 8 | static_cast<uint>(input[ne >> 3]) << 1 * 8 | static_cast<uint>(input[se >> 3]) << 2 * 8 | static_cast<uint>(input[sw >> 3]) << 3 * 8;//В этой строчке рассчитываются все 8 клеток обрабатываемого байта output[id]=static_cast<uchar>(calculateLife<7>(self, n_e_s_w, nw_ne_se_sw)); }; 


High Level


After receiving a signal from the lower level that the calculation is finished, the entire buffer is copied at a high level in order to ensure thread-safe reading of data for rendering and possible preservation of the game state upon exit.

Cells are drawn by Qt, namely by filling in pixels in QImage. The interface is made in QML. Pixels are filled only for a small area of ​​the playing field visible to the player. This way you avoid unnecessary resources for rendering.

The calculation process at both the lower and upper levels takes place in different threads. At the upper level, the input of new cells that make up the pattern that the user has selected is processed. Interestingly, this is the only logic that uses the mutex. For all other synchronization operations, both at the upper and lower levels, and between them atomic variables are used.

Benchmarks


The results of measurements of the speed of the game for different machines depending on the size of the playing field are shown in the tables.The values ​​obtained in the tables are indicated in milliseconds. During the tests, the playing field was filled with life by 50% with the help of a random number generator.

The tables have measurements of the speed of the lower and upper levels, and the minimum values ​​are taken. And for a more general picture, measurements of the calculation speed of 100 game steps and the average time to complete a step are given. The latter is more indicative, as it includes time for rendering and other related operations.

MacBook Pro 2014
Processor 2.6 GHz Dual-Core Intel Core i5
Memory 8 GB 1600 MHz DDR3
Graphics Intel Iris 1536 MB

GPU implementation
1024 2048 4096 8192 16384 32768
Low (min) 0 0 2 9 43 170
High Level (min) 0 0 0 1 12 55
100 steps 293 446 1271 2700 8603 54287
One Step Time (avg) 3 4 13 27 86 542

CPU implementation
1024 2048 4096 8192 16384 32768
Low (min) 3 25 117 552 2077 21388
High Level (min) 0 0 0 1 4 51
100 steps 944 3894 15217 65149 231260 -
One Step Time (avg) 9 39 152 651 2312 -

MacBook Pro 2017
Processor 2.8 GHz Intel Core i7
Memory 16 GB 2133 MHz LPDDR3
Graphics Intel HD Graphics 630 1536 MB

GPU implementation
1024 2048 4096 8192 16384 32768
Low (min) 0 0 0 2 8 38
High Level (min) 0 0 0 0 3 13
100 steps 35 67 163 450 1451 5886
One Step Time (avg) 0 1 2 5 15 59

CPU implementation
1024 2048 4096 8192 16384 32768
Low (min) 1 7 33 136 699 2475
High Level (min) 0 0 0 0 3 18
100 steps 434 1283 4262 18377 79656 264711
One Step Time (avg) 4 13 43 184 797 2647

It can be seen that on an old macbook the processor does not cope with a huge game field at all and the application becomes unnatural. Even the processor of the new MacBook barely draws this amount of computation. But the video card significantly surpasses the processor in both old and new hardware in performance.Using a GPU, the new MacBook offers a completely comfortable user experience, even in vast gaming spaces.

Summary


At the moment, the game is in the development stage of interfaces and, first of all, is sharpened for mobile interfaces, and therefore control is lame on computers.

Unlike the original article, I did not set the goal to benchmark various optimizations, therefore it is impossible to compare two articles one to one, but the material perfectly reveals the topic of comparing the potential of the GPU and CPU with the significant superiority of the first.

In the future, I want to hold benchmarks on mobile devices. The project is at an early stage of development, and therefore I do not have an Apple-Developer account for conducting such tests.

Thanks for your attention! I will be glad to any comments on the article and on the project:
GitHub code.

Metal code for the lower level implementation

Low-level CPU implementation code

Top-level code .

Source