I recently wrote my first 4K intro in Rust and introduced it at Nova 2020, where it won first place in the New School Intro Competition. Writing an intro 4K is quite difficult. This requires knowledge of many different areas. Here I will focus on methods to minimize Rust code as much as possible.


You can view the demo on Youtube , download the executable file on Pouet or get the source code with Github .

Intro 4K is a demo in which the entire program (including any data) takes up 4096 bytes or less, so it is important that the code is as efficient as possible. Rust has a reputation for creating bloated executables, so I wanted to find out if I could write efficient and concise code on it.

Configuration


All intro is written in combination of Rust and glsl. Glsl is used for rendering, but Rust does the rest: creating the world, controlling the camera and objects, creating tools, playing music, etc.

The code has dependencies on some functions that are not yet included in stable Rust, so I use the Nightly Rust toolkit. To install and use this default set, run the following rustup commands:

rustup toolchain install nightly rustup default nightly 

I use crinkler to compress the object file generated by the Rust compiler.

I also used shader minifier to preprocess the CDMY0CDMY shader to make it smaller and more convenient for crinkler. Shader minifier does not support output to.rs, so I took the raw output and manually copied it to my file shader.rs ( by hindsight it is clear that it was necessary to somehow automate this stage, or even write a pool request for shader minifier).

My past intro 4K on Rust served as a starting point, which then seemed pretty concise to me. That article also has more details on setting up the CDMY1CDMY file and how to use xargo to compile a tiny binary.

Optimize program design to reduce code


Many of the most effective size optimizations cannot be called smart hacks. This is the result of a rethinking of design.

In my original project, one part of the code created the world, including the placement of the spheres, and the other part was responsible for moving the spheres. At some point, I realized that the placement code and the sphere movement code do very similar things, and you can combine them into one much more complex function that does both. Unfortunately, such optimizations make the code less elegant and readable.

Assembler code analysis


At some point, you will have to look at the compiled assembler and figure out what the code compiles into and what size optimizations are worth it. The Rust compiler has a very useful CDMY2CDMY option for outputting assembler code. The following command creates the assembler file CDMY3CDMY:

xargo rustc --release --target i686-pc-windows-msvc -- --emit=asm 

You do not have to be an expert in assembler to benefit from studying the output of assembler, but it is definitely better to have a basic understanding of syntax. The CDMY4CDMY option forces the compiler to optimize the code for the smallest size. After that, it’s a little harder to figure out which part of the assembler code corresponds to which part of the Rust code.

I found that the Rust compiler can be surprisingly good at minimizing, removing unused code, and unnecessary parameters. He also does some strange things, so it’s very important to study the result in assembler from time to time.

Additional Features


I worked with two versions of the code. One logs the process and allows the viewer to manipulate the camera to create interesting trajectories. Rust allows you to define functions for these additional actions. The CDMY5CDMY file has a section [features] that allows you to declare available functions and their dependencies. The CDMY6CDMY of my intro 4K has the following section:

[features] logger=[] fullscreen=[] 

None of the additional functions have dependencies, so they work effectively as conditional compilation flags. Conditional blocks of code are preceded by the operator CDMY7CDMY. Using functions alone does not make the code smaller, but it greatly simplifies the development process when you easily switch between different sets of functions.

#[cfg(feature="fullscreen")] {//Этот код компилируется только в том случае, если выбран полноэкранный режим } #[cfg(not(feature="fullscreen"))] {//Этот код компилируется только в том случае, если полноэкранный режим не выбран } 

Having studied the compiled code, I am sure that only selected functions are included in it.

One of the main uses of the functions was to enable logging and error checking for debug builds. Code loading and compiling the glsl shader often failed, and without useful error messages it would be extremely difficult to find problems.

Using get_unchecked


When placing the code inside the CDMY8CDMY block, I kind of assumed that all security checks would be disabled, but this is not so. All the usual checks are still performed there, and they are expensive.

By default, range checks all calls to the array. Take the following Rust code:

delay_counter=sequence[ play_pos ]; 

Before searching the table, the compiler will insert code that checks that play_pos is not indexed after the end of the sequence, and panics if this is the case. This adds a considerable size to the code, because there can be many such functions.

Convert the code as follows:

delay_counter=*sequence.get_unchecked( play_pos ); 

This tells the compiler to not perform any range checks and just search the table. This is clearly a dangerous operation and therefore it can only be performed within the CDMY9CDMY code.

More efficient loops


Initially, all of my loops used were performed idiomatically as expected in Rust, using the CDMY10CDMY syntax. I assumed that it would be compiled into the most dense loop possible. Surprisingly, this is not so. The simplest case:

for x in 0..10 {//do code } 

will be compiled into assembler code that does the following:

setup loop variable loop: проверить условие цикла если цикл закончен, перейти в end//выполнить код внутри цикла безусловно перейти в loop end: 

whereas the following code

let x=0; loop{//do code x += 1; if x == 10 { break; } } 

Directly compiles to:

setup loop variable loop://выполнить код внутри цикла проверить условие цикла если цикл не закончен, перейти в loop end: 

Note that the condition is checked at the end of each cycle, which makes an unconditional jump unnecessary. This is a small saving of space for one cycle, but they really add up to a good saving when the program has 30 cycles.

Another, much more difficult to understand problem with the idiomatic Rust loop is that in some cases, the compiler added some additional iterator setup code that really inflated the code. I still don’t understand what causes this additional iterator setup, since it was always trivial to replace CDMY11CDMY constructs with CDMY12CDMY constructs.

Using vector instructions


I spent a lot of time optimizing the CDMY13CDMY code, and one of the best optimizations (which usually also speeds up the code) consists in working with the entire vector at the same time, and not with each component in turn.

For example, the ray tracing code uses the fast grid traversal algorithm to check which parts of the map are visited by each ray. The original algorithm considers each axis separately, but you can rewrite it so that it considers all the axes at the same time and does not need any branches. Rust doesn't actually have its own vector type, such as glsl, but you can use internal components to specify the use of SIMD instructions.

To use the built-in functions, I would convert the following code

global_spheres[ CAMERA_ROT_IDX ][ 0 ] += camera_rot_speed[ 0 ]*camera_speed; global_spheres[ CAMERA_ROT_IDX ][ 1 ] += camera_rot_speed[ 1 ]*camera_speed; global_spheres[ CAMERA_ROT_IDX ][ 2 ] += camera_rot_speed[ 2 ]*camera_speed; 

in this:

let mut dst:x86::__m128=core::arch::x86::_mm_load_ps(global_spheres[ CAMERA_ROT_IDX ].as_mut_ptr()); let mut src:x86::__m128=core::arch::x86::_mm_load_ps(camera_rot_speed.as_mut_ptr()); dst=core::arch::x86::_mm_add_ps( dst, src); core::arch::x86::_mm_store_ss( (&mut global_spheres[ CAMERA_ROT_IDX ]).as_mut_ptr(), dst ); 

which will be slightly smaller (and much less readable). Unfortunately, for some reason this broke the debug build, although it worked great in the release build.Clearly, the problem here is with my knowledge of Rust's internal tools, and not with the language itself. It is worth spending more time preparing the next 4K intro, as the reduction in code size was significant.

Using OpenGL


There are many standard Rust crates for loading OpenGL functions, but by default they all load a very large set of functions. Each loaded function takes up some space, because the loader must know its name. Crinkler compresses this kind of code very well, but it is not able to completely get rid of the overhead, so I had to create my own version of CDMY14CDMY, which included only the necessary OpenGL functions.

Conclusion


The main goal was to write a competitive, correct 4K intro and prove that Rust is suitable for the demo scene and for scenarios where every byte matters and you really need low-level control. As a rule, only assembler and C were considered in this area. An additional goal was to maximize the use of idiomatic Rust.

It seems to me that I quite successfully coped with the first task. There was never a feeling that Rust was somehow holding me back or that I was sacrificing performance or features because I use Rust and not C.

I coped with the second task less successfully. Too much unsafe code that really shouldn't be there. CDMY15CDMY has a destructive effect; it is very easy to use it to quickly accomplish something (for example, using mutable static variables), but as soon as unsafe code appears, it generates even more unsafe code, and suddenly it is everywhere. In the future, I will be much more careful to use CDMY16CDMY only when there really is no alternative.

Source