Hello, Habr!

We bring to your attention a translation of an interesting study from Crowdstrike. The material is devoted to the use of the Rust language in the field of Data Science (in relation to malware analysis) and demonstrates how Rust can compete with NumPy and SciPy in such a field, not to mention pure Python .

ITKarma picture

Enjoy reading!

Python is one of the most popular programming languages ​​for working with data science, and it is no coincidence. The Python Package Index (PyPI) has a huge array of impressive libraries for working with data science, in particular, NumPy, SciPy, Natural Language Toolkit, Pandas and Matplotlib. Featuring an abundance of high-quality analytic libraries available and an extensive community of developers, Python is the obvious choice for many data researchers.

Many of these libraries are implemented in C and C++ for performance reasons, but provide external function interfaces (FFIs) or Python bindings so that functions can be called from Python. These implementations in lower-level languages ​​are designed to mitigate Python's most noticeable shortcomings, in particular those related to runtime and memory consumption. If you manage to limit the runtime and memory consumption, then scalability is greatly simplified, which is critical to reduce costs. If we can write high-performance code that solves the problems of data science, then integrating such code with Python will be a serious advantage.

When working at the junction data science and malware analysis requires not only fast execution, but also efficient use of shared resources, again, for scaling. The problem of scaling is one of the key in the field of "big data", such as the efficient processing of millions of executable files on many platforms. To achieve good performance on modern processors requires parallelism, usually implemented using multithreading; but it is also necessary to increase the efficiency of code execution and memory consumption. When solving such problems, it can be difficult to balance the resources of the local system, and correctly implementing multi-threaded systems is even more difficult. The essence of C and C++ is that they are not thread safe. Yes, there are external platform-specific libraries, but ensuring thread safety is obviously a developer’s responsibility.

Parsing malware is basically a dangerous activity. Malicious software often inappropriately handles data structures regarding the file format, thus destroying analytic utilities. The relatively common trap that awaits us in Python is due to the lack of good type safety. Python, generously accepting the values ​​of CDMY0CDMY, when CDMY1CDMY was expected to be in their place, can slip into a full bedlam, which can be avoided only by stuffing the code with checks on CDMY2CDMY. Such assumptions related to duck typing often lead to collapses.

But there is Rust. Rust is largely positioned as the perfect solution to all the potential problems outlined above: runtime and memory consumption are comparable to C and C++, and extensive type safety is provided. Rust also provides additional amenities, in particular, serious guarantees of memory security and no execution time overhead. Since there are no such costs, the integration of Rust code with the code of other languages, in particular Python, is simplified. In this article, we will take a short tour of Rust to see if it is worthy of the associated hype.

Sample application for Data Science

Data science is a very broad subject area with many applied aspects, and it is impossible to discuss them all in one article. A simple task for data science is computing informational entropy for byte sequences. The general formula for calculating the entropy in bits is given in Wikipedia :

ITKarma picture

To calculate the entropy for a random variable CDMY3CDMY, we first count how many times each possible byte value ITKarma picture, and then divide this number by the total number of elements encountered in order to calculate the probability of meeting a specific value ITKarma picture, respectively ITKarma picture. Then we consider the negative value of the weighted sum of the probabilities of the specific value of xi occurring ITKarma picture, as well as the so-called proprietary information ITKarma picture. Since we calculate the entropy in bits, it uses ITKarma picture(note base 2 for bits).

Let's try Rust and see how it handles entropy calculations compared to pure Python, as well as some of the most popular Python libraries mentioned above. This is a simplified assessment of Rust's potential performance in data science; this experiment is not a critique of Python or the excellent libraries it has. In these examples, we will generate our own C library from Rust code that we can import from Python. All tests were performed on Ubuntu 18.04.

Pure Python

Let's start with a simple function in pure Python (in CDMY4CDMY) for calculating the entropy CDMY5CDMY, we will use only the mathematical module from the standard library. This feature is not optimized, take it as a starting point for modifications and performance measurements.

import math def compute_entropy_pure_python(data): """Compute entropy on bytearray `data`.""" counts=[0] * 256 entropy=0.0 length=len(data) for byte in data: counts[byte] += 1 for count in counts: if count != 0: probability=float(count)/length entropy -= probability * math.log(probability, 2) return entropy 

Python with NumPy and SciPy

Unsurprisingly, SciPy provides a function for calculating entropy. But first, we will use the NumPy function CDMY6CDMY to calculate byte frequencies. Comparing the performance of the SciPy entropy function with other implementations is a little dishonest, since the implementation from SciPy has additional functionality for calculating the relative entropy (Kullback-Leibler distance). Again, we are going to conduct a (hopefully not too slow) test drive to see what the performance of compiled Rust libraries imported from Python will be. We will stick with the implementation from SciPy, which is included in our script CDMY7CDMY.

import numpy as np from scipy.stats import entropy as scipy_entropy def compute_entropy_scipy_numpy(data): """Вычисляем энтропию bytearray `data` с SciPy и NumPy.""" counts=np.bincount(bytearray(data), minlength=256) return scipy_entropy(counts, base=2) 

Python with Rust

Further, we will study our implementation in Rust in more detail, compared with previous implementations, for the sake of solidity and consolidation. Let's start with the default library package generated with Cargo. The following sections show how we modified the Rust package.

cargo new --lib rust_entropy Cargo.toml 

We start with the required manifest file CDMY8CDMY, in which we define the Cargo package and specify the library name, CDMY9CDMY. We use the cpython public container (v0.4.1), available on crates.io, in the Rust Package Registry. In this article, we use Rust v1.42.0, the latest stable version available at the time of writing.

[package] name="rust-entropy" version="0.1.0" authors=["Nobody <nobody@nowhere.com>"] edition="2018" [lib] name="rust_entropy_lib" crate-type=["dylib"] [dependencies.cpython] version="0.4.1" features=["extension-module"] 


The implementation of the Rust library is very simple. As with our implementation in pure Python, we initialize the counts array for each possible byte value and iterate over the data to populate the counts.To complete the operation, we calculate and return the negative sum of probabilities multiplied by ITKarma pictureProbabilities.

use cpython::{py_fn, py_module_initializer, PyResult, Python};///вычисляем энтропию массива байт fn compute_entropy_pure_rust(data: &[u8]) -> f64 { let mut counts=[0; 256]; let mut entropy=0_f64; let length=data.len() as f64;//collect byte counts for &byte in data.iter() { counts[usize::from(byte)] += 1; }//вычисление энтропии for &count in counts.iter() { if count != 0 { let probability=f64::from(count)/length; entropy -= probability * probability.log2(); } } entropy } 

All that remains for us to take from CDMY10CDMY is a mechanism for calling a pure Rust function from Python. We include in the CDMY11CDMY a function adapted to work with CPython CDMY12CDMY to call our "pure" Rust function CDMY13CDMY. By doing so, we only win, as we will support the only pure Rust implementation, and we will also provide a wrapper convenient for working with CPython.

///Функция Rust для работы с CPython fn compute_entropy_cpython(_: Python, data: &[u8]) -> PyResult<f64> { let _gil=Python::acquire_gil(); let entropy=compute_entropy_pure_rust(data); Ok(entropy) }//инициализируем модуль Python и добавляем функцию Rust для работы с CPython py_module_initializer!( librust_entropy_lib, initlibrust_entropy_lib, PyInit_rust_entropy_lib, |py, m | { m.add(py, "__doc__", "Entropy module implemented in Rust")?; m.add( py, "compute_entropy_cpython", py_fn!(py, compute_entropy_cpython(data: &[u8]) ) )?; Ok(()) } ); 

Call Rust code from Python

Finally, we call the Rust implementation from Python (again, from CDMY14CDMY). To do this, we first import our own dynamic system library compiled from Rust. Then we simply call the provided library function, which was previously specified when initializing the Python module using the CDMY15CDMY macro in our Rust code. At this stage, we have only one Python module (CDMY16CDMY), which includes functions for calling all implementations of calculating entropy.

import rust_entropy_lib def compute_entropy_rust_from_python(data): ""Вычисляем энтропию bytearray `data` при помощи Rust.""" return rust_entropy_lib.compute_entropy_cpython(data) 

We compile the above Rust library package on Ubuntu 18.04 using Cargo. (This link may come in handy for OS X users.)

cargo build --release 

Having finished with the assembly, we rename the resulting library and copy it to the directory where our Python modules are located, so that it can be imported from scripts. The library created using Cargo is called CDMY17CDMY, but it will need to be renamed CDMY18CDMY in order to import successfully as part of these tests.

Performance Test: Results

We measured the performance of each function implementation using pytest breakpoints, calculating the entropy for more than 1 million randomly selected bytes. All implementations are shown on the same data. Benchmarks (also included in entropy.py) are shown below.

# ### КОНТРОЛЬНЫЕ ТОЧКИ ### # генерируем случайные байты для тестирования w/NumPy NUM=1000000 VAL=np.random.randint(0, 256, size=(NUM, ), dtype=np.uint8) def test_pure_python(benchmark): """тестируем чистый Python.""" benchmark(compute_entropy_pure_python, VAL) def test_python_scipy_numpy(benchmark): """тестируем чистый Python со SciPy.""" benchmark(compute_entropy_scipy_numpy, VAL) def test_rust(benchmark): """тестируем реализацию Rust, вызываемую из Python.""" benchmark(compute_entropy_rust_from_python, VAL) 

Finally, we make separate simple driver scripts for each method needed to calculate entropy. The following is a representative driver script for testing implementation in pure Python. The CDMY19CDMY file contains 1,000,000 random bytes used to test all methods. Each of the methods repeats the calculations 100 times to simplify the capture of data on memory usage.

import entropy with open('testdata.bin', 'rb') as f: DATA=f.read() for _ in range(100): entropy.compute_entropy_pure_python(DATA) 

Implementations for both SciPy/NumPy and Rust showed good performance, easily arranging an unoptimized implementation in pure Python more than 100 times. The Rust version proved to be only slightly better than the SciPy/NumPy version, but the results confirmed our expectations: pure Python is much slower than compiled languages, and extensions written in Rust can quite successfully compete with C analogues (defeating them even in such microtesting).

There are also other methods to increase productivity. We could use the CDMY20CDMY or CDMY21CDMY modules. Could add type hints and use Cython to generate a library that could be imported from Python. With all of these options, trade-offs specific to each particular solution must be considered.

ITKarma picture

We also measured the memory consumption for each function implementation using the GNU CDMY22CDMY application (do not confuse it with the built-in shell command CDMY23CDMY). In particular, we measured the maximum resident set size.

Whereas in pure Python and Rust implementations, the maximum sizes of this part are very similar, the SciPy/NumPy implementation consumes significantly more memory for this benchmark. Presumably this is due to additional features loaded into memory upon import. Be that as it may, calling the Rust code from Python does not seem to add any serious memory overhead.

ITKarma picture


We are extremely impressed with the performance achieved by invoking Rust from Python.During our frankly brief assessment, the Rust implementation was able to compete in performance with the basic C implementation from the SciPy and NumPy packages. Rust seems to be great for efficient large-scale processing.

Rust showed not only excellent runtime; It should be noted that the memory overhead in these tests was also minimal. Such runtime and memory usage characteristics are ideal for scaling purposes. The performance of SciPy and NumPy C FFI implementations is definitely comparable, but with Rust we get additional advantages that C and C++ do not give us. Safeguards for memory and thread safety are a very attractive advantage.

While C provides Rust-comparable runtimes, C as such does not provide thread safety. There are external libraries that provide such functionality for C, but the developer is fully responsible for the correct use of them. Rust monitors thread safety issues, such as the occurrence of races, during compilation - thanks to the ownership model implemented in it - and the standard library provides a set of competitive mechanisms, including channels, locks and smart pointers with reference counting.

We do not encourage porting SciPy or NumPy to Rust, as these Python libraries are already well optimized and supported by cool developer communities. On the other hand, we strongly recommend porting code from pure Python to Rust that is not provided in high-performance libraries. In the context of data science applications used for security analysis, Rust seems to be a competitive alternative for Python, given its speed and security guarantees.