Jellyfish, passports and govnokod - why the passport numbers of all participants in Internet voting got on the Internet
By the way, the Ministry of Communications is still excluding ANY the possibility of leakage of voter passport data
Meanwhile, the distribution of passport series looks like this:
Let's play the events and try to understand how all this could have been avoided
On July 9, Medusa's material Authority appears in fact, they made publicly available personal data of all Internet voters where they talked about the degvoter.zip archive.
How to find the degvoter.zip archive?
I found so. A careful search through Yandex led me to the page:
The text “Https checkvoter.gosuslugi.ru degvoter.zip” was found there. The dating at that time was 7.7.2020 (before the publication of Medusa!), Now this text has already “moved” to the top of the page and the dating has changed.
The archive itself was removed from the public service website, but a copy of it was saved in web.archive.org, from where it was downloaded by all persons interested in the study, including myself. To understand why this happened, I recommend referring to the source - the file robots.txt on the website of the State Service.
What is inside degvoter.exe?
The degvoter program itself is written in C # and is a WinForms-written application that works with a sqlite database. Files in the archive are dated 2020-06-30 22:17 (June 30, 2020). It can be seen that the application was written in the shortest possible time, because on Kamchatka at that moment it was already July 1 at 7:17, and the fact that the sections opened in there at 8:00 suggests that the deadline was closer than ever (it’s good that electronically only Moscow and Nizhny Novgorod voted.
Passport Verification Code:
The application, both from an architectural point of view and from a cryptographic one, is the latest shit. And here's why:
Description of architecture miscalculations and the principle of attack on the restoration of passport identifiers
Included with the program was a local database in which there was a passports table with two fields num and used. Where num was SHA256 (& lt; series > + & lt; number >).
Very often, when a programmer without relevant experience approaches cryptography issues, he makes a bunch of mistakes of the same type. One of these errors is the use of hash functions without any dodging. The passport identifier consists of a 4-digit series and a 6-digit number [xxxx xxxxxx]. Those. we have 10 ^ 10 options. The phone number, by the way, also consists of 10 digits [+7 (xxx) xxx-xx-xx]. On the scale of the modern digital world, these are not such big numbers. So one GB is more than 10 ^ 9 bytes, i.e. 100GB is enough to write all the options. It is likely that they can be trivial to sort out. I measured that in single-threaded mode the modern Intel Core i5 processor iterates through all sha256 hashes for one passport series in 5 seconds (000000-999999). And this is on the standard sha256 implementation without any additional tricks. Those. a complete search of all the space on a regular computer will take less than a day. If we consider that the search can be conducted in several threads, then the average processor will cope with such a task in a few hours. This is a demonstration of the fact that the system designer does not understand the principles of using hash functions. But even the correct use of hash functions with this architecture does not save passport data from disclosure if the adversary has unlimited resources. After all, a person who has gained access to the database can get passport identifiers in a finite time, becauseverification of one passport must pass a finite time. The whole question is only in resources (although if only a couple of millions of rounds had been applied here, even such a controversial architectural solution as spreading the database with the application would not have such a resounding effect, as it would have protected at least from journalists ) The jellyfish merely demonstrated the incompetence of the people who designed this part of the system.
Let's try to figure out how to make it much better on the one hand, and on the other hand to keep within one night of development.
Architecture on the knee
Suppose we don’t have time at all and we need to write a solution during the night.
The obvious requirement is that the database with passport hashes must be on the server, and this must be a client-server application. The question immediately arises, but what if the Internet suddenly breaks down on the site? For these purposes, you need to make an Android version of the client application, which also needs to be given to PEC members to download. In places where there is neither the Internet nor cellular communications, people did not vote at this vote.
The hash in the database should not be calculated directly from the passport identifier. This is done so that the hashes in the database could not be picked up using existing tables for enumeration. First, you need to use the persistent hash function. The main question is HOW to use it. There are many possible implementations here, but in essence it all comes down to applying an algorithm in which there will be three parameters: the type of hash function, the number of iterations, and the value (s) that must be used to mix with the hash (it will be common to all hashes). The final requirement is that within each iteration, the hash function must be used, and the hash calculation speed should be several units per second. Even if an attacker took possession of the database from the server in this case, it would take considerable time to recover all the data.
Each of the client applications will be just an input field + Http-client, which sends a request to the server.
The server runs only on HTTPS and only during voting and has a limit of 1 RPS per second with IP. As an RPS limiter, we use Redis, where we write the IP address and TTL as a key in one second. There is a value - a request with IP is not allowed, no value - a request with IP is allowed. This will make it possible to avoid brute force from the outside.
Written in this way, our solution, literally from shit and sticks, will be an order of magnitude more secure than the current degvoter. Moreover, the difference in writing time is small and with the process of writing code can be parallelized for 3 people (server, win-client, android-client).
We’ll look at possible leak scenarios.
We have the following points where you can get information about the system
1. Server side source code
2. Compiled server side files
3. Server database
4. Client Applications
Client applications in this case do not carry any information, while the maximum number of people has access to them, and it is here that the maximum probability of leaks (which happened)
In order to recover information, you will need to access information from points (1,2) or (1,3). If there is only a base, then without a known hashing method it will be impossible to restore something.
1. Every time when you need to work with your personal data in some way - involve an architect
2. Every time when you need to work with personal data in some way - attract a developer with experience/education in the field of cryptography or information security
These two simple rules will help to avoid the shame that we saw with the degvoter example (Remember that a typical developer may not understand the nuances of using hash functions)
A utility to demonstrate the ability to recover personal data DegvoterDecoder is located in the repository dedicated to the analysis of voting data. By default, it is configured for 8 threads. If you have already downloaded the degvoter.zip archive and you are programming in C #, you can easily figure out how it works.