Foreword


I looked many times at the Internet, found many interesting articles about hash tables, but I did not find an intelligible and complete description of how they are implemented. In this regard, I just could not wait to write a post on this, so interesting, topic.


It may not be so useful for experienced programmers, but it will be interesting for students of technical universities and novice self-taught programmers.


image


Motivation to use hash tables


For clarity, consider standard containers and the asymptotic behavior of their most commonly used methods.


Container \ operation insert remove find
Array O (N) O (N) O (N)
List O (1) O (1) O (N)
Sorted array O (N) O (N) O (logN)
Binary Search Tree O (logN) O (logN) O (logN)
Hash table O (1) O (1) O (1)

All data provided well-executed containers, well-chosen hash functions
From this table, it is very clear why it is worth using hash tables. But then the opposite question arises: why then they are not used constantly?
The answer is very simple: as always, it is impossible to get everything at once, namely: both speed and memory. Hash tables are heavy, and although they quickly answer questions of basic operations, using them all the time is very costly.


The concept of a hash table


A hash table is a container. His implementation may not be obvious, but rather simple.


To begin with an explanation in a few words. We define a hash function that will determine a natural number for each incoming element. And further down this natural number, we will put the element in (let's say) an array. Then having such a function, we can process the element in O (1).


Now it became clear why this is the hash table.


Collision Problem


Naturally, the question arises, why is it impossible that we get into the same cell twice in the array, because it is simply impossible to imagine a function that compares completely different natural numbers to each element. This is how the problem of collision arises, or problems when the hash function produces the same natural number for different elements.


There are several solutions to this problem: the chaining method and the double hashing method. In this article I will try to talk about the second method, as a more beautiful and, possibly, more complex.


Solving the double-hash collision problem


We will (as the name implies) use two hash functions that return mutually simple integers.


One hash function (upon entering g) will return a positive integer s, which will be initial for us. That is, the first thing we will do is try to put the g element at position s in our array. But what if this place is already taken? This is where the second hash function comes in handy, which will return t - the step with which we will continue to look for a place where to put the element g.


We will consider first the element s, then s + t, then s + 2 * t, etc. Naturally, in order not to go beyond the boundaries of the array, we must look at the element number modulo (the remainder of the division by the size of the array).


Finally, we explained all the most important points, we can proceed to direct code writing, where it will be possible to consider all the remaining nuances.Well, a rigorous mathematical proof of the correctness of using double hashing can be found here .




Implementation of the hash table


For clarity, we will implement a hash table that stores strings.


Let's start by defining the hash functions themselves, we implement them using the Horner method. An important parameter of the correctness of the hash function is that the return value must be mutually simple with the size of the table. To reduce code duplication, we will use two structures that refer to the implementation of the hash function itself.


int HashFunctionHorner(const std::string& s, int table_size, const int key) { int hash_result=0; for (int i=0; s[i] != 0; ++i) hash_result=(key * hash_result + s[i]) % table_size; hash_result=(hash_result * 2 + 1) % table_size; return hash_result; } struct HashFunction1 { int operator()(const std::string& s, int table_size) const { return HashFunctionHorner(s, table_size, table_size - 1);//ключи должны быть взаимопросты, используем числа <размер таблицы> плюс и минус один. } }; struct HashFunction2 { int operator()(const std::string& s, int table_size) const { return HashFunctionHorner(s, table_size, table_size + 1); } }; 

To move on, we need to deal with the problem: what will happen if we remove the item from the table? So, you need to mark it with the deleted flag, but you can’t just delete it permanently. After all, if we do this, then when we try to find an element (the value of the hash function of which coincides with its value at our remote element), we will immediately come across an empty cell. And this means that such an element never happened, although it lies just somewhere further in the array. This is the main difficulty in using this method of resolving collisions.


Remembering this problem, we will build our class.


template <class T, class THash1=HashFunction1, class THash2=HashFunction2> class HashTable { static const int default_size=8;//начальный размер нашей таблицы constexpr static const double rehash_size=0.75;//коэффициент, при котором произойдет увеличение таблицы struct Node { T value; bool state;//если значение флага state=false, значит элемент массива был удален (deleted) Node(const T& value_) : value(value_), state(true) {} }; Node** arr;//соответственно в массиве будут хранится структуры Node* int size;//сколько элементов у нас сейчас в массиве (без учета deleted) int buffer_size;//размер самого массива, сколько памяти выделено под хранение нашей таблицы int size_all_non_nullptr;//сколько элементов у нас сейчас в массиве (с учетом deleted) }; 

At this stage, we have more or less understood what will be stored in the table. Let's move on to the implementation of utility methods.


... public: HashTable() { buffer_size=default_size; size=0; size_all_non_nullptr=0; arr=new Node*[buffer_size]; for (int i=0; i < buffer_size; ++i) arr[i]=nullptr;//заполняем nullptr - то есть если значение отсутствует, и никто раньше по этому адресу не обращался } ~HashTable() { for (int i=0; i < buffer_size; ++i) if (arr[i]) delete arr[i]; delete[] arr; } 

Of the necessary methods, it remains to implement dynamic increase, array expansion - the Resize method.


We’re doubling the size standardly.


void Resize() { int past_buffer_size=buffer_size; buffer_size *= 2; size_all_non_nullptr=0; size=0; Node** arr2=new Node * [buffer_size]; for (int i=0; i < buffer_size; ++i) arr2[i]=nullptr; std::swap(arr, arr2); for (int i=0; i < past_buffer_size; ++i) { if (arr2[i] && arr2[i]->state) Add(arr2[i]->value);//добавляем элементы в новый массив }//удаление предыдущего массива for (int i=0; i < past_buffer_size; ++i) if (arr2[i]) delete arr2[i]; delete[] arr2; } 

It is important to maintain the asymptotic behavior of O (1) standard operations. But what can affect the speed of work? Our deleted items. After all, as we remember, we can’t do anything with them, but we cannot completely reset them. So they reach for us with huge ballast. To speed up the work of our hash table, we will use a rehash (as we recall, we have already allocated very strange variables for this).


Now we’ll use them if the percentage of real array elements is less than 50, we produce Rehash , namely, we do the same thing as when we enlarge the table (resize), but do not increase it. This may sound silly, but I'll try to explain now. We will call our hash functions from all elements, move them to a new array. But this will not happen with deleted elements, we will not move them, and they will be deleted along with the old table.


But why words, the code will clarify everything:


void Rehash() { size_all_non_nullptr=0; size=0; Node** arr2=new Node * [buffer_size]; for (int i=0; i < buffer_size; ++i) arr2[i]=nullptr; std::swap(arr, arr2); for (int i=0; i < buffer_size; ++i) { if (arr2[i] && arr2[i]->state) Add(arr2[i]->value); }//удаление предыдущего массива for (int i=0; i < buffer_size; ++i) if (arr2[i]) delete arr2[i]; delete[] arr2; } 

Well now we are definitely on the final, albeit long, and full of thorny bushes, straight. We need to implement the insertion (Add), removal (Remove) and search (Find) element.


Let's start with the simplest - the Find method element by value.


bool Find(const T& value, const THash1& hash1=THash1(), const THash2& hash2=THash2()) { int h1=hash1(value, buffer_size);//значение, отвечающее за начальную позицию int h2=hash2(value, buffer_size);//значение, ответственное за "шаг" по таблице int i=0; while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) return true;//такой элемент есть h1=(h1 + h2) % buffer_size; ++i;//если у нас i >= buffer_size, значит мы уже обошли абсолютно все ячейки, именно для этого мы считаем i, иначе мы могли бы зациклиться. } return false; } 

Next, we will implement the removal of the element - Remove . How do we do this? We find the element (as in the Find method) and then delete it, that is, just change the value of state to false , but we will not delete Node itself.


bool Remove(const T& value, const THash1& hash1=THash1(), const THash2& hash2=THash2()) { int h1=hash1(value, buffer_size); int h2=hash2(value, buffer_size); int i=0; while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) { arr[h1]->state=false; --size; return true; } h1=(h1 + h2) % buffer_size; ++i; } return false; } 

Well, the last one we implement is the Add method. It has some very important nuances. This is where we will check for the need for Rehesh.


In addition to this, there is one more part in this method that supports the correct asymptotic behavior. This is remembering the first item to be inserted (even if it is deleted). This is where we will insert the element if our hash table does not have the same. If there are no deleted items in our path, we create a new Node with our inserted value.


bool Add(const T& value, const THash1& hash1=THash1(),const THash2& hash2=THash2()) { if (size + 1 > int(rehash_size * buffer_size)) Resize(); else if (size_all_non_nullptr > 2 * size) Rehash();//происходит рехеш, так как слишком много deleted-элементов int h1=hash1(value, buffer_size); int h2=hash2(value, buffer_size); int i=0; int first_deleted=-1;//запоминаем первый подходящий (удаленный) элемент while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) return false;//такой элемент уже есть, а значит его нельзя вставлять повторно if (!arr[h1]->state && first_deleted == -1)//находим место для нового элемента first_deleted=h1; h1=(h1 + h2) % buffer_size; ++i; } if (first_deleted == -1)//если не нашлось подходящего места, создаем новый Node { arr[h1]=new Node(value); ++size_all_non_nullptr;//так как мы заполнили один пробел, не забываем записать, что это место теперь занято } else { arr[first_deleted]->value=value; arr[first_deleted]->state=true; } ++size;//и в любом случае мы увеличили количество элементов return true; } 

In conclusion, I’ll give a complete implementation of the hash table.
int HashFunctionHorner(const std::string& s, int table_size, const int key) { int hash_result=0; for (int i=0; s[i] != 0; ++i) { hash_result=(key * hash_result + s[i]) % table_size; } hash_result=(hash_result * 2 + 1) % table_size; return hash_result; } struct HashFunction1 { int operator()(const std::string& s, int table_size) const { return HashFunctionHorner(s, table_size, table_size - 1); } }; struct HashFunction2 { int operator()(const std::string& s, int table_size) const { return HashFunctionHorner(s, table_size, table_size + 1); } }; template <class T, class THash1=HashFunction1, class THash2=HashFunction2> class HashTable { static const int default_size=8; constexpr static const double rehash_size=0.75; struct Node { T value; bool state; Node(const T& value_) : value(value_), state(true) {} }; Node** arr; int size; int buffer_size; int size_all_non_nullptr; void Resize() { int past_buffer_size=buffer_size; buffer_size *= 2; size_all_non_nullptr=0; size=0; Node** arr2=new Node * [buffer_size]; for (int i=0; i < buffer_size; ++i) arr2[i]=nullptr; std::swap(arr, arr2); for (int i=0; i < past_buffer_size; ++i) { if (arr2[i] && arr2[i]->state) Add(arr2[i]->value); } for (int i=0; i < past_buffer_size; ++i) if (arr2[i]) delete arr2[i]; delete[] arr2; } void Rehash() { size_all_non_nullptr=0; size=0; Node** arr2=new Node * [buffer_size]; for (int i=0; i < buffer_size; ++i) arr2[i]=nullptr; std::swap(arr, arr2); for (int i=0; i < buffer_size; ++i) { if (arr2[i] && arr2[i]->state) Add(arr2[i]->value); } for (int i=0; i < buffer_size; ++i) if (arr2[i]) delete arr2[i]; delete[] arr2; } public: HashTable() { buffer_size=default_size; size=0; size_all_non_nullptr=0; arr=new Node*[buffer_size]; for (int i=0; i < buffer_size; ++i) arr[i]=nullptr; } ~HashTable() { for (int i=0; i < buffer_size; ++i) if (arr[i]) delete arr[i]; delete[] arr; } bool Add(const T& value, const THash1& hash1=THash1(),const THash2& hash2=THash2()) { if (size + 1 > int(rehash_size * buffer_size)) Resize(); else if (size_all_non_nullptr > 2 * size) Rehash(); int h1=hash1(value, buffer_size); int h2=hash2(value, buffer_size); int i=0; int first_deleted=-1; while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) return false; if (!arr[h1]->state && first_deleted == -1) first_deleted=h1; h1=(h1 + h2) % buffer_size; ++i; } if (first_deleted == -1) { arr[h1]=new Node(value); ++size_all_non_nullptr; } else { arr[first_deleted]->value=value; arr[first_deleted]->state=true; } ++size; return true; } bool Remove(const T& value, const THash1& hash1=THash1(), const THash2& hash2=THash2()) { int h1=hash1(value, buffer_size); int h2=hash2(value, buffer_size); int i=0; while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) { arr[h1]->state=false; --size; return true; } h1=(h1 + h2) % buffer_size; ++i; } return false; } bool Find(const T& value, const THash1& hash1=THash1(), const THash2& hash2=THash2()) { int h1=hash1(value, buffer_size); int h2=hash2(value, buffer_size); int i=0; while (arr[h1] != nullptr && i < buffer_size) { if (arr[h1]->value == value && arr[h1]->state) return true; h1=(h1 + h2) % buffer_size; ++i; } return false; } }; 
.

Source