Quick search without index
We all have days at work when someone comes to us with a truly impossible requirement, the fulfillment of which requires a miracle. My case occurred when a marketing colleague came up to me with a seemingly simple question: if I could get data from one table for a certain month a couple of years ago. You can say nothing wrong, but I vaguely remembered that the table was very large. The table had a datetime field with the creation time, but was there an index on this field?
Of course, they also wanted to get the data quickly. I, as usual, said: “I will see what I can do” and went to take a closer look at the table under discussion. Good luck will never leave us, the index really did not exist, and the table was huge. Not a problem, we can scan the table, right? Wrong. If I learned something over the years of working with databases, then it is that size matters. A table with hundreds of millions of records, consisting of several integer columns, would be quite formidable. Then add the various varchar and datetime columns. Now this is a real challenge, isn't it?
The table I was looking at had a billion records, literally. A simple table scan could easily take more than a day, and I needed to process the extracted data as well. In addition, scanning a table of this size may not be as favorable for the general state of the system as it seems at first glance. All scanned pages should be extracted from the disks into the sql memory of the server, filling it. This, in turn, will unload data pages from the cache that can be used for current operational tasks. Requests from your current users may wait longer while the server rereads data from disks, rather than quickly reusing data pages from memory. Performance may decline, and in the worst case, the server may be completely on its knees. Thus, when scanning a table, you should use a special technique - run it in small batches, saving somewhere the current position and intermediate results. This approach allows the server to reconfigure and have a respite, avoiding a large performance drawdown.
And so, I thought over the scenario and estimated the time it would take to start and execute the script, when it occurred to me that the datetime values of the creation time were associated with the table identifier. The identifier (ID) was a column with constantly growing values and a primary key based on it. When selecting by identifier, the creation date and time values were naturally pre-sorted in the same way as the identifier values. Wait, I thought, I can realize something extremely fast than scanning a table, something that in the worst case would require only 30 steps instead of 500,000,000! Binary Search!
For everyone who, like me, completed their studies a long time ago, retraining in theory should be in the order of things. Binary search is a simple but extremely efficient way to search for a value in a sorted array (in our case, in a table column). The array must be pre-sorted and finite. The search is performed by comparing the middle element of your array with the target. If they are equal, then the search is over. If not, you delete the half in which the element you are looking for is obviously missing, and repeat the previous step until you have one. Find the middle, compare the target with it, if they are not equal, remove half again and so on. The total number of steps required to find the target in the worst case, when you find it to the very last step, will be log (N), where N is the number of elements. Compare this to N/2 when you just iterate over an array. Generally speaking, this would be N, but if we could guess whether the goal was closer to the end, we could go back, thereby reducing the total number of steps needed.
In my case, N=1,000,000,000 and this is how I came to the two numbers above: 30 and 500,000,000. An identifier (ID) would play the role of an array enumerator, and the creation datetime would be the value of the array element. Although there is one caveat. An array enumerator, by definition, is a continuous sequential sequence of integers.There can easily be spaces in table identifiers due to deleting a record or re-filling an identifier. A simple definition of the middle by dividing the range of identifiers by 2, you should not expect that there will be a record with such an identifier. Instead of directly searching, I had to use the top () function. Something like this:
Select top(1) * from Table where id <= @id order by id desc
I used “& lt; =” and “desc” because I wanted to find a value either equal to or immediately before the goal. Provided that @l, @r is the left and right borders, respectively, id is the middle, @dt is the creation time (creation datetime), tdt is the goal and id r real table identifier (ID), the algorithm may look like this:
while @l <@r begin -- найти середину @id=@l +floor((@r-@l)/2) -- найти запись в таблице select top(1) @idr=id, @dt=creation_datetime from Table where id <= @id order by id desc -- переместить границу if(@dt<@tdt) @l=@id +1 else @r=@id end
The last id r found would be the identifier of the entry after which it was needed. The algorithm has something like a “left” shift, that is, a tendency to choose the leftmost of all values. Since I wanted the record with the value to be as close to the target as possible, I also checked the nearest left and right neighbors in the table to see if one of them was better. Please note that I did not use the real identifier from the table for the iteration process, as in this case, with spaces in the ID values and in some circumstances, the algorithm could go into an infinite loop.
Writing and testing the script took me about an hour. Using it, I found the first record with a specific date and time of creation. After that, I used a simple select statement with a where clause that contained both conditions: id >=@ and creation_datetime >=@ dt1 and creation_datetime & lt; @ dt2. I only needed to make sure that the optimizer would choose to use a primary key instead of an index or a table scan. All in all, I did it in less than 2 hours! It was surprising to discover again that the algorithms are not something esoteric buried deep inside the sql server, but rather something that can be easily used in everyday work.
Below is the whole script that I wrote. Please see the comments inside to understand how to use it.
/* Запустите этот скрипт на вашей бд Он найдет значение, равное или непосредственно перед целью. Обратите внимание, что точность datetime ограничена 3 мс */-- флаг отладки, если установлен, он будет показывать результаты для каждой итерации declare @debug bit=0 -- @Table - имя таблицы, в которой вы хотите искать. declare @Table varchar(128)='TableX' -- @Id - имя столбца идентификатора (id) для таблицы declare @ID varchar(128)='Id' -- @DateTimeColumn - имя вашего datetime столбца со временем создания в таблице declare @DateTimeColumn varchar(128)='created_dt' -- это целевое значение даты и времени declare @TargetDateTime datetime='2020-01-03 18:23:03' declare @Sql varchar(max) set @sql=' -- это ваш отладочный флаг declare @debug bit=<debug> -- это ваше целевое значение declare @tdt datetime=''<TargetDateTime>'' -- в этой переменной мы сохраняем промежуточное значение (результат деления) declare @id bigint -- это ваши левая и правая границы соответственно declare @l bigint, @r bigint -- это переменные, в которых мы храним результаты текущего шага поиска, datetime и идентификатор таблицы соответственно declare @dt datetime, @idr bigint -- найти первые «левые» и «правые» значения select @l=min(<ID>), @r=max(<ID>) from <Table> while @l < @r begin -- ожидаемый идентификатор set @id=@l +floor((@r-@l)/2) -- найти значение и идентификатор записи select top(1) @dt=<DateTimeColumn>, @idr=<ID> from <Table> where <ID> <= @id order by <ID> desc -- если требуется отладка, покажите текущие значения if( @debug=1 ) begin select ''@id''=@id, ''@l''=@l, ''@r''=@r, ''@dt''=@dt, ''@idr''=@idr end if(@dt < @tdt) set @l=@id +1 else set @r=@id end -- проверьте, есть ли у соседних записей лучшее совпадение declare @t table(id bigint, dt datetime, dlt float) insert @t(id, dt, dlt) select top(1) <ID>, <DateTimeColumn>, abs(cast(<DateTimeColumn> as float) -cast(@tdt as float)) from <Table> where <ID> < @idr order by <ID> desc insert @t(id, dt, dlt) select top(1) <ID>, <DateTimeColumn>, abs(cast(<DateTimeColumn> as float) -cast(@tdt as float)) from <Table> where <ID> > @idr order by <ID> insert @t(id, dt, dlt) select @idr, @dt, abs(cast(@dt as float) -cast(@tdt as float)) select top(1) @dt=dt, @idr=id from @t order by dlt, id select ''Found Value''=@dt, ''Record Id''=@idr ' set @sql=replace( replace( replace( replace( replace(@sql, '<ID>', @ID), '<Table>', @Table), '<TargetDateTime>', convert(varchar(50), @TargetDateTime, 121)), '<debug>', @debug), '<DateTimeColumn>', @DateTimeColumn) exec (@sql)