Interval Queries in SQL Server Part 5

This is the fifth part of the solutions for interval queries in SQL Server. I know I wrote in the first one that there would be four solutions; however, I found out another one while writing posts about other four. For an introduction, please refer to the blog post Interval Queries in SQL Server Part 1 (http://blogs.solidq.com/dsarka/Post.aspx?ID=149&title=Interval+Queries+in+SQL+Server+Part+1). You can find the second part of the solutions in the blog post Interval Queries in SQL Server Part 2 (http://blogs.solidq.com/dsarka/Post.aspx?ID=150&title=Interval+Queries+in+SQL+Server+Part+2), the third part in the blog post Interval Queries in SQL Server Part 3 (http://blogs.solidq.com/dsarka/Post.aspx?ID=151&title=Interval+Queries+in+SQL+Server+Part+3), and the fourth part in the blog post Interval Queries in SQL Server Part 4 (http://blogs.solidq.com/dsarka/Post.aspx?ID=152&title=Interval+Queries+in+SQL+Server+Part+4). Note that you also need to read an excellent article by Itzik Ben-Gan wrote on interval queries in SQL Server (http://sqlmag.com/t-sql/sql-server-interval-queries) by using the Relational Interval Tree model. I am using the tables and data Itzik has prepared. In order to test the solutions, you can download the code from Itzik's article by using the link in this paragraph.

I described the solution that uses the unpacked form of the intervals in part 3. The solution is very good from the querying perspective, the performance is great, and it does not depend on the data distribution, specifically does not depend on length distribution of the intervals, like the solutions I presented in parts 1 and 4 do. However, the unpacked intervals solution can become a nightmare from the maintenance perspective, because the number of rows in the unpacked version of the table or indexed view could become enormous.

I decided to rethink the unpacked intervals solution. Do I really need all unit intervals, i.e. all points from an interval, stored in a table or indexed view? The main goal is that the WHERE clause of the query that searches for the overlapping intervals with a given interval has to be supported by a single index. This way, SQL Server Query Optimizer can find an efficient plan. I realized that I actually need only the boundaries of an interval, the lower and the upper columns, stored in a single column. Of course, this means that I need only two rows per interval. Please note: two rows only, no matter of the length of the interval. I realized that instead of using the unpacked form I can use the unpivoted form. Remember the table that represents the traditional solution, which is presented in part 1 of this series? Here is the code, just to remind you; I am not showing the code that populates this table here.

CREATE TABLE dbo.Intervals
(

id INT NOT NULL,
lower INT NOT NULL,
upper INT NOT NULL,
CONSTRAINT CHK_Intervals_upper_gteq_lower CHECK(upper >= lower)
);

 

I don't want to change the structure of this table, because this is the correct representation of the intervals. Therefore, I have to create a redundant table to store the unpivoted intervals.

CREATE TABLE dbo.Intervals_Unpivoted_Table
(

id INT NOT NULL,
pointtype CHAR(1) NOT NULL,
point INT NOT NULL,
);
GO

ALTER TABLE dbo.Intervals_Unpivoted_Table
ADD CONSTRAINT PK_Intervals_Unpivoted_Table

PRIMARY KEY(point, pointtype, id);
GO

 

For the overlapping intervals query, all I would need would be the id and the point columns. However, for other type of queries, and in order to have a key, I also added the pointtype column. This column will hold a single letter 'b' denoting that this is the beginning point of an interval or letter 'e' denoting that this is the end of an interval. Columns id and point are not enough to enforce uniqueness; in a unit interval, the lower and the upper boundaries are the same, and thus I would get a duplicate key. Note also the order of the columns in the key. The point column is in the first position, because this is going to be the most useful column for seeks. Let's now populate the table. I am not too big fan of the UNPIVOT operator; that's why I am doing the unpivoting manually, with help of the CROSS JOIN operator and the CASE expression.

INSERT INTO dbo.Intervals_Unpivoted_Table
WITH(TABLOCK) (id, pointtype, point)
SELECT id, pointtype,
CASE WHEN pointtype =
'b' THEN
lower ELSE upper END AS point
FROM dbo.Intervals CROSS JOIN
(SELECT 'b' UNION SELECT 'e') AS be(pointtype);
GO

 

It took less than 40 seconds to insert the data on my notebook. Therefore, the insert is very efficient. Of course, I am inserting 20,000,000 rows only, and not over 100,000,000 rows, as with the unpacked intervals solution. Of course, in a production system, you would need to maintain this redundant dbo.Intervals_Unpivoted_Table table. You would need to create DML triggers on the source dbo.Intervals table. Now let's check the performance of the query that searches for overlapping intervals in the middle of the data.

-- query
SET STATISTICS IO ON;
SET
STATISTICS TIME ON;
GO


-- middle of data
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;

SELECT DISTINCT id
FROM dbo.Intervals_Unpivoted_Table
WHERE point <=
@u

AND point >= @l
OPTION (RECOMPILE);
GO

-- logical reads: 3, CPU time: 0 ms

 

You can see that the query performs really excellent. I will leave to you to check how the query performs when you are searching for the overlapping intervals at the beginning or at the end of the data. I tested this, and I get three logical reads constantly.

Can I maintain this redundant table without the DML triggers? You can imagine that the answer is yes, otherwise I would not write this questionJ This is possible if I replace the table with an indexed view. I am using the insert query for the view definition. However, I need to change the query definition, because indexed views still do not allow derived tables. Therefore, instead of the derived table, I create a permanent table called dbo.Matrix and populate it with two rows. Please check carefully the values I am inserting.

CREATE TABLE dbo.Matrix
(l INT NOT
NULL,
u INT NOT NULL,
pointtype CHAR(1) NOT NULL);
GO


INSERT INTO dbo.Matrix (l, u, pointtype)
VALUES (1, 0, 'b'), (0, 1, 'e');

 

I also decided to change the calculation of the values for the point column. Instead of the CASE expression, with help of the dbo.Matrix table, I am using simple arithmetic to calculate it. I do the calculation in the body of the view I am indexing. The following code shows the view creation and population.

CREATE VIEW dbo.Intervals_Unpivoted_View
WITH SCHEMABINDING
AS
SELECT id, pointtype,
lower *
l
+ upper * u AS point
FROM dbo.Intervals
CROSS JOIN dbo.Matrix;
GO

-- Index the view
CREATE UNIQUE CLUSTERED INDEX PK_Intervals_Unpivoted_View
ON dbo.Intervals_Unpivoted_View (point, pointtype, id);
GO

 

It took also only about 40 seconds to populate the view. Therefore, the view is completely maintainable, even if you need to re-index or recreate it.

-- query
SET STATISTICS IO ON;
SET
STATISTICS TIME ON;
GO


-- middle of data
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;

SELECT DISTINCT id
FROM dbo.Intervals_Unpivoted_View (NOEXPAND)
WHERE
point <=
@u

AND
point
>= @l
OPTION (RECOMPILE);
GO

-- logical reads: 3, CPU time: 0 ms

 

As you can see, the performance is still fantastic. This time I used the NOEXPAND table option to prevent SQL Server from expanding the view to the base table. Anyway, it looks like this solution is really good. The query performance is excellent, the maintenance is more than reasonable, and the queries are really simple.

Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.