Interval Queries in SQL Server Part 1

My good old friend Itzik Ben-Gan wrote an excellent article on interval queries in SQL Server (http://sqlmag.com/t-sql/sql-server-interval-queries) by using the Relational Interval Tree model. Based on the model developed by Kriegel, Pötke, and Seidl, and enhanced by Martin, Itzik fully developed a T-SQL solution. The solution is great, and makes interval queries efficient in all circumstances. However, the solution is quite complex. Itzik made a Microsoft Connect feature proposal (https://connect.microsoft.com/SQLServer/feedback/details/780746) to add SQL Server Engine support for interval queries. I fully agree that this would be the best solution; now when the theory is known and implementation made possible, it is time that Microsoft puts this in the database engine.

Unfortunately, it looks like temporal data support is not coming in SQL Server soon, at least not yet in the next version (2014). Therefore, it looks like we have to use our own solutions for a while. Itzik's fantastic article made me think. I was wondering whether there could be some simpler solutions, even if they are not efficient in all circumstances, even if they are applicable for specific cases only. I found three solutions, and in addition, another good friend from SolidQ, Davide Mauri, presented me the fourth one. In this and next three blog posts I am going to present all four solutions. Please note that I am using the tables and data Itzik has prepared. In order to test the solutions, you can download the code from Itzik's article by using the link at the beginning of this post.

Let me start with the classical solution. I am copying the code from Itzik's article for creating and populating the table with 10,000,000 rows, and then for querying the middle of the table.

CREATE TABLE dbo.Intervals

(

id INT NOT NULL,

lower INT NOT NULL,

upper INT NOT NULL,

CONSTRAINT CHK_Intervals_upper_gteq_lower CHECK(upper >= lower)

);

 

INSERT INTO dbo.Intervals WITH(TABLOCK) (id, lower, upper)

SELECT id, lower, upper

FROM dbo.Stage;

 

ALTER TABLE dbo.Intervals ADD CONSTRAINT PK_Intervals PRIMARY KEY(id);

 

CREATE INDEX idx_lower ON dbo.Intervals(lower) INCLUDE(upper);

CREATE INDEX idx_upper ON dbo.Intervals(upper) INCLUDE(lower);

GO

-- ~ 60 seconds needed for creating and populating the table

 

-- query

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

GO

 

-- middle of data

DECLARE @l AS INT = 5000000, @u AS INT = 5000020;

 

SELECT id

FROM dbo.Intervals

WHERE lower <= @u AND upper >= @l

OPTION (RECOMPILE); -- used to allow variable sniffing and prevent plan reuse

-- logical reads: 11254, CPU time: 764 ms

 

As Itzik has shown, the problem with this query is that SQL Server can use only one index, and successfully eliminate rows that are not candidates for the result from one side only, and then scans the rest (approximately half of the) data. In Itzik's and my case presented here, SQL Server used the idx_upper index to eliminate the intervals from the left, which satisfy the condition upper >= @l. Considering this limitation, usage of a single index for a single query, no matter whether it is idx_lower or idx_upper, I started to search for a solution which would use that index for elimination of the rows from both sides. The following figure shows an example of possible intervals.

Black color denotes the intervals in the table. The blue colored interval is the one I am checking for the overlaps. The intervals are sorted by the lower boundary, representing SQL Server's usage of the idx_lower index. Eliminating intervals from the right side of the given (blue) interval is simple: just eliminate all intervals where the beginning is at least one unit bigger (more to the right) of the end of the given interval. You can see this boundary in the figure denoted with yellow line. However, eliminating from the left is more complex. In order to use the same index, the idx_lower index for eliminating from the left, I need to use the beginning of the intervals in the table in the WHERE clause of the query. I have to go to the left side away from the beginning of the given (blue) interval at least for the length of the longest interval in the table, which is marked with red color in the figure. The intervals that begin before the left yellow line cannot overlap with the given (blue) interval.

Of course, the figure could be turned around to cover the cases when the idx_upper index would be more useful. With this index, the elimination from the left is simple – eliminate all of the intervals which end at least one unit before the beginning of the given interval. This time, the elimination from the right is more complex – the end of the intervals in the table cannot be more to the right than the end of the given interval plus maximal length of all intervals in the table.

I created a new table with the same structure as the original Itzik's table, with just one computed column (ilen) added. This column, which calculates the length of an interval, is persisted and indexed in order to speed up the search for the longest interval. The following code creates and populates this table.

CREATE TABLE dbo.IntervalsL

(

id INT NOT NULL,

lower INT NOT NULL,

upper INT NOT NULL,

ilen AS upper - lower PERSISTED,

CONSTRAINT CHK_IntervalsL_upper_gteq_lower CHECK(upper >= lower)

);

 

INSERT INTO dbo.IntervalsL WITH(TABLOCK) (id, lower, upper)

SELECT id, lower, upper

FROM dbo.Stage;

 

ALTER TABLE dbo.IntervalsL ADD CONSTRAINT PK_IntervalsL PRIMARY KEY(id);

 

CREATE INDEX idx_lowerL ON dbo.IntervalsL(lower) INCLUDE(upper);

CREATE INDEX idx_upperL ON dbo.IntervalsL(upper) INCLUDE(lower);

CREATE INDEX ids_ilenL ON dbo.IntervalsL(ilen);

GO

-- ~ 80 seconds needed for creating and populating the table

 

The creating and the population of the table was slightly longer. Of course, this is due to the creation of the additional index. In addition, because of the persisted computed column, the rows are longer and thus SQL Server needed to allocate more pages for the same 10,000,000 rows.

I find the maximal length of the intervals in advance and then store it in a variable. The query that checks for the overlaps in the middle of the data uses just slightly more complex WHERE clause, otherwise it is equal to the original query, as you can see from the following code.

-- query

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

GO

 

-- middle of data

DECLARE @l AS INT = 5000000, @u AS INT = 5000020;

 

DECLARE @max AS INT;

SET @max =

(SELECT MAX(ilen) AS maxlen FROM

(SELECT MAX(ilen) AS ilen FROM dbo.IntervalsL

UNION

SELECT @u - @l) AS m1

);

 

SELECT id

FROM dbo.Intervals

WHERE lower <= @u AND lower >= @l - @max

AND upper >= @l AND upper <= @u + @max

OPTION (RECOMPILE);

GO

-- logical reads: 6 (3 + 3 to calculate the max length), CPU time: 0 ms

 

Note the performance of the query. With Itzik's data, it needed only 6 logical reads, and was thus performing much better that even the fastest solution using the RI tree model, as you can see in the Itzik's article. However, please note that this performance is the consequence of the specific data in the table. The maximal length of an interval is 20, and there are 10,000,000 intervals in the table. This way, SQL Server can very efficiently eliminate intervals from both sides. However, if there would be only one long interval in the table, the code would become much less efficient, because SQL Server would not be able to eliminate a lot of rows from one side, either left or right, depending which index it would use. For example, the following code sets the maximal length of the intervals in the table to a predefined size 100,000, and then executes the same query.

-- Check with longer maximal interval

DECLARE @l AS INT = 5000000, @u AS INT = 5000020;

 

DECLARE @max AS INT = 100000;

 

SELECT id

FROM dbo.Intervals

WHERE lower <= @u AND lower >= @l - @max

AND upper >= @l AND upper <= @u + @max

OPTION (RECOMPILE);

GO

-- logical reads: 228, CPU time: 0 ms

 

The number of logical reads has grown substantially. In worst case, with a very long interval in the table, the performance of the query would be comparable to the performance of the classical solution presented in the beginning of this blog and in the beginning of the Itzik's article.

To conclude, this solution might be very useful in case when your data has more or less uniform distribution of the lengths, or, as in the case with Itzik's test data, when the distribution is not uniform but the maximal length is small. If this is the case, you might prefer this solution to the RI tree model solution, because it is much simpler. However, as I already stated in this blog post, Itzik's solution does not depend on any data distribution. Therefore, if you don't know or control the distribution of your data, you should go for the RI tree model solution.

I will show more solutions in the forthcoming blogs.

Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.