TextRadar fuzzy search algorithm. Index (part 3)
The task of searching for a phrase in a paginated text is reduced to calculating the relevance coefficient for each of the pages and then sorting the list in descending order of the coefficient.
In the calculation process, in accordance with the basic approach, each page is subjected to symbol-by-symbol analysis and here lies the possibility of optimization.
When calculating the coefficient, the groups of matching characters of the search strings and data are analyzed, while the groups can be formed only within the words. On the other hand, considering the problem of the overestimation of the “long” words (in part 2 ), a possible solution was proposed “ calculating the relevance of the search phrase as the average of the relevance of the words included in it, calculated independently ." Using the index allows you to get a result equivalent to this particular approach.
The idea of indexing is not new and consists in the following - due to the fact that the words in the text are repeated, the amount of computation can be reduced. For this, the text in which the search will be carried out, you must first generate an index. In the simplest case, the index is a table of all words of the text, which in addition to words contains data about the pages on which they appear. A fragment of the index table, built on the text of the novel by L.N. Tolstoy's “War and Peace” (in total it contains about 50 thousand words), is shown in the figure.
Directly during the processing of the request, the search string is first broken into words. Next, for each of the words in the search string, relevance is calculated with each of the words index . The calculation results are accumulated in the preliminary results table , which contains the columns “Word of the search string”, “Index word”, “Relevance coefficient”, “Page number”. Only those rows of the index fall into the table whose relevance coefficient for the words exceeded the specified threshold. That is, each index line with a suitable word generates in the preliminary results table rows in an amount equal to the number of pages of text on which this word occurs. Below is a snippet of the table of preliminary results for searching for the phrase “Evening at Anna Pavlovna Scherer.”
Then the table of preliminary results is sorted by the columns Page number , Word of the search string , Coefficient (descending).
Bypassing the rows of the sorted table, for each of the pages and for each word of the search string, the most relevant word of this page is revealed - thanks to the sorting described above, this is the first word for each combination Page Number - Search String Word . The remaining lines in the combination are discarded.
Thus, for each page of text that falls into the table of preliminary results , for each word of the search string, the most relevant words of this page with the corresponding coefficients will be found.The sum of the coefficients of the words found on the page, related to the number of words in the search bar, will give the relevance coefficient for the page.
For example, in the above figure, the search is performed on the line “Evening with Anna Pavlovna Scherer” (we do not take into account the preposition “y”), the lines highlighted in gray are discarded during the crawl. The relevance coefficient for page 1 will be (0.75 + 1 + 0.875 + 1)/4=0.906, for page 2 - 0.906, for 3 - 0.75.
If you do not take into account the time spent on indexing, the results of which are used repeatedly and take into account that the total amount of calculations, which was estimated in hours. 1 , a multiple of the length of the data line:
we can say that the gain from using the index will be a multiple of the ratio:
For example, at the demo stand textradar.ru , the text of the novel "War and Peace" is divided into 1488 pages with 2000 characters each. At the same time, the total number of characters in the words of the index, consisting of 52156 elements, is 434060. That is, the gain from indexing will be about 7:
Due to the fact that the results obtained using indexing are equivalent to the results obtained using one of the previously described approaches without it, it becomes possible to jointly process the search results on the indexed and non-indexed parts of the text. Due to the fact that indexing is a relatively expensive operation and is usually performed according to a schedule, it is possible that some of the text is indexed and some are not yet. In this case, the savings in the amount of calculations will be a multiple of the share of the indexed text in its total size:
In addition to increasing search speed, the use of the index opens up a whole series of possibilities. For example, with the help of statistics obtained during indexing, it is possible to increase the importance of rare words, as well as highlight pages of text on which significant words of a search phrase are more common. You can also expand the index table with synonyms.
This completes the TextRadar series of publications. Thank you all for your interest and valuable comments that allowed us to advance further than originally planned.
The results of applying the described approaches can be seen on the demo stand deployed on the website textradar.ru . The implementation (C # ASP.NET MVC) can be found here .