Interval Queries in SQL Server Part 3

This is the third part of the solutions for interval queries in SQL Server. For an introduction, please refer to the blog post Interval Queries in SQL Server Part 1 (http://blogs.solidq.com/dsarka/Post.aspx?ID=149&title=Interval+Queries+in+SQL+Server+Part+1). You can find the second part of the solutions in the blog post Interval Queries in SQL Server Part 2 (http://blogs.solidq.com/dsarka/Post.aspx?ID=150&title=Interval+Queries+in+SQL+Server+Part+2). Note that you also need to read an excellent article by Itzik Ben-Gan wrote on interval queries in SQL Server (http://sqlmag.com/t-sql/sql-server-interval-queries) by using the Relational Interval Tree model. I am using the tables and data Itzik has prepared. In order to test the solutions, you can download the code from Itzik's article by using the link in this paragraph.

Let me start with an introduction to the unpacked form of intervals. In order to make this introduction complete, I am also briefly introducing the packed form of intervals, although I do not need this form for querying for overlapping intervals. For more in-depth knowledge about intervals, interval operators, and interval forms, please refer to the Inside Microsoft® SQL Server® 2008: T-SQL Programming book by Itzik Ben-Gan , Dejan Sarka, Roger Wolter, Greg Low , Ed Katibah, and Isaac Kunen, Microsoft Press, 2009 (http://www.amazon.com/gp/product/0735626022/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=1AEWTQ7GTY1XS1R68R2M&pf_rd_t=101&pf_rd_p=1389517282&pf_rd_i=507846). I explain a lot of details in chapter 12, Temporal Support in the Relational Model.

Any set of points in time can be represented by a collection of intervals, which I'll call an interval set. Look at the following three interval sets:

  • {(2:5), (3:7), (10:12)}
  • {(2:2), (3:3), (4:4), (5:5), (6:6), (7:7), (10:10), (11:11), (12:12)}
  • {(2:7), (10:12)}

Each set represents the same set of time points. The first set contains overlapping intervals. The second set contains only intervals exactly one time point long, or unit intervals. The third set represents the time points using the smallest possible number of intervals.

An interval set is in expanded form (or is unpacked) if it contains unit intervals only. An interval set is in collapsed form (or is packed) if it contains no intervals that overlap or meet. Any set of points in time can be represented uniquely in either form, and you can require one or the other form without losing anything.

I can transform a relation that has an attribute—such as during—of interval data type into another relation similarly to the way I rearranged sets of intervals. Without further delay, I'll define the unpacked form of a relation: a relation with full temporal support is unpacked if every value of interval type is a unit interval.

It would be very nice if T-SQL supported the UNPACK operator. However, there is no such operator in T-SQL yet. So how can you get an unpacked form of a relation? Again, the numbers auxiliary table or tabular function get handy here. I can join the original table with intervals with the auxiliary numbers table using a non-equijoin, using the BETWEEN operator to find all time points between the beginning and end of an interval, or simply use the APPLY operator to apply the auxiliary numbers tabular function to each row of the intervals table. From the time points, I can create unit intervals.

My first solution uses an indexed view. The APPLY operator is still not supported in an indexed view; therefore, to get the unpacked form, I have to rely on the JOIN operator. This means that instead of the tabular function, I have to create and populate an auxiliary table of numbers, which the following code does.

CREATE TABLE dbo.Nums
(n INT NOT NULL PRIMARY KEY);
GO


DECLARE @max
AS INT, @rc AS INT, @d AS DATE;
SET
@max = 10000000;
SET @rc = 1;

INSERT INTO dbo.Nums VALUES(1);
WHILE
@rc *
2 <= @max

BEGIN
INSERT INTO dbo.Nums
SELECT n + @rc FROM dbo.Nums;
SET @rc =
@rc
* 2;
END

INSERT INTO dbo.Nums
SELECT n
+ @rc
FROM dbo.Nums
WHERE n + @rc <= @max;
GO

 

Here is the code that creates a new table with intervals, a view in the unpacked form on this table, and then indexes the view to persist the unpacked form.

CREATE TABLE dbo.IntervalsU
(

id INT NOT NULL,
lower INT NOT NULL,
upper INT NOT NULL,
CONSTRAINT CHK_IntervalsU_upper_gteq_lower CHECK(upper >= lower)
);
GO

ALTER
TABLE dbo.IntervalsU ADD CONSTRAINT PK_IntervalsU PRIMARY KEY(id);
GO

-- Create view Intervals_Unpacked
IF OBJECT_ID('dbo.Intervals_Unpacked', 'V') IS NOT NULL
DROP VIEW dbo.Intervals_Unpacked;
GO
CREATE VIEW dbo.Intervals_Unpacked
WITH SCHEMABINDING
AS
SELECT i.id, n.n
FROM dbo.IntervalsU AS i
INNER JOIN dbo.Nums AS n
ON n.n BETWEEN i.lower AND i.upper;
GO
-- Index the view
CREATE UNIQUE CLUSTERED INDEX PK_Intervals_Unpacked
ON dbo.Intervals_Unpacked(n, id);
GO

 

Note that in the composite index on the view, I put the number n in the first place. This is what I am going to seek for in my queries. Let's insert the data in the new table, and SQL Server will insert also the data in the view.

-- SLOW! (17+ min!)
INSERT INTO dbo.IntervalsU WITH(TABLOCK) (id, lower, upper)
SELECT id, lower, upper
FROM dbo.Stage;
GO

 

You can see from the comment that the indexed view slowed down the insert substantially. Still, the performance is acceptable. However, when I tried to insert the data in the table first, and then create the index on the view, the performance was not really acceptable. At least not for me – after two hours I lost my patience and cancelled the index creation. This is not a negligible issue; imagine what would happen if you would need to re-index the view in production.

Time for querying the table. I am showing here the query that selects the same rows from the middle of the data as the query in my previous articles and as the queries in Itzik's article.

-- query
SET STATISTICS IO ON;
SET
STATISTICS TIME ON;
GO


-- middle of data
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
DECLARE @t AS TABLE(id INT);
INSERT
INTO @t(id)
SELECT id

FROM dbo.Intervals_Unpacked
WHERE n BETWEEN
@l
AND
@u;
SELECT DISTINCT id

FROM @t;
-- logical reads: 6 (5 + 1 from the table variable), CPU time: 0 ms
GO

 

I use a table variable to temporarily store the id column of the intervals that do include at least one point of the given interval, the interval I am checking for overlapping. The logic is simple – if an interval from the table includes at least one point that the given interval includes, than the two intervals overlap. I decided to store these id values in a table variable and then perform the DISTINCT operator on them in order to prevent SQL Server to expand the view to the base table and make the query less efficient. I could prevent this also with a query hint.

As you can see from the comment, the query is very efficient. You get the same efficiency also if you query the beginning or the end of the data. In addition, the query performance does not depend on the length distribution of the intervals. From the querying point of the view, this is practically an ideal solution. The queries are fast and simple, much simpler than the queries on the RI tree model that Itzik explained in his article.

Of course, you already know there is no free lunch. The drawback of this solution is the same as its advantage – the indexed view. Maintenance of an indexed view is a nightmare. In addition, there are many, many more rows in the unpacked form than in the original form. With this data, the original form has 10,000,000 rows, with maximal length of an interval 20, and average length around 10. I got more than 100,000,000 rows in the unpacked form. With longer intervals, there could potentially be tens of times more rows in the unpacked form.

<rant>

Many of the problems could be mitigated if Microsoft would spend some time and improve the indexed views. However, like many times with Microsoft, it seems that they are simply forgetting on some features, and leave them half-baked. Then they are surprised when we don't use these features in production. And instead of finalizing these features, they are simply adding new features in the next version, again half-baked. Like children that get bored with a toy very quickly and want a new toy much too soon.

</rant>

Since SQL Server can't help me here, I decided to create a table in the unpacked form. Re-indexing a table is not a problem, and therefore the maintenance is not such a nightmare. In addition, I can use data compression to save the space. However, SQL Server does not automatically update my table. In order to get it updated automatically, you need to create DML triggers on the original table. Below is the code that created and populates a table in the unpacked form.

CREATE TABLE dbo.Intervals_Unpacked_Table
(

id INT NOT NULL,
n INT NOT NULL
)
WITH (DATA_COMPRESSION = ROW);
GO


ALTER TABLE dbo.Intervals_Unpacked_Table
ADD CONSTRAINT PK_Intervals_Unpacked_Table
PRIMARY KEY(n, id);
GO

-- 4 min with row compression
INSERT INTO dbo.Intervals_Unpacked_Table WITH(TABLOCK) (id, n)
SELECT id, n

FROM dbo.Intervals
CROSS APPLY dbo.GetNums(lower, upper);
GO

 

The population of this table was reasonably fast. Of course, there are still more than 100,000,000 rows in this table. What is the space consumption?

EXEC sys.sp_spaceused N'dbo.Intervals', TRUE;
EXEC sys.sp_spaceused N'dbo.Intervals_Unpacked', TRUE;
EXEC sys.sp_spaceused N'dbo.Intervals_Unpacked_Table', TRUE;
GO
-- Original table reserverd:                 566,424 KB
-- Indexed view reserved:                1,857,808 KB
-- Table with row compressions reserved:        1,234,000 KB

 

From the space consumption point of view, this is not problematic. The unpacked table occupies a bit more than twice as much space as the original table. However, remember that the original table does not have any additional attribute besides the interval boundaries and the key. In production, a real table would have many more columns, and thus much longer rows, and the space consumption would be probably even bigger that the space consumption of the unpacked table. I also tried to use the page compression; however, I did not get much better compression rate with this data set and decided to stick with the row compression. So let's query the unpacked table!

-- query
SET STATISTICS IO ON;
SET
STATISTICS TIME ON;
GO


-- middle of data
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT DISTINCT id
FROM dbo.Intervals_Unpacked_Table
WHERE n BETWEEN
@l
AND @u;
GO
-- logical reads: 7, CPU time: 0 ms

 

The query is very efficient again, and the performance also does not depend on the interval length distribution. However, with longer intervals, you could get many more rows, and still have issues with the maintenance. Still, I like this solution quite a lot, because the queries are simple, efficient, and have stable performance regarding the data distribution.

Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.

where can i buy cheap cytotec tablets
where can i buy cheap cytotec tablets - sreda, 27. november 2024

Anone else feel tired all the time <a href=https://cytotec2buy.top/>buying cytotec price</a> Giltay None

HedaWhece
HedaWhece - petek, 22. november 2024

El Hachimi K, Benjelloun H, Zaghba N, Yassine N <a href=https://fastpriligy.top/>buy priligy cheap</a>

HedaWhece
HedaWhece - četrtek, 07. november 2024

The following antibodies were incubated overnight at 4 C Unc5B Cell Signaling, 13851S, dilution 1 1000, Robo4 Invitrogen, 20221 1 AP, dilution 1 300, Flrt2 Novus bio, NBP2 43653, dilution 1 500, Netrin 1 R D, AF1109, dilution 1 500, Claudin 5 Invitrogen, 35 2500, dilution 1 1000, PLVAP BD biosciences, 550563, dilution 1 200, PDGFRОІ Cell Signaling, 3169S, dilution 1 1000, GFAP Millipore, MAB360, dilution 1 1000, VEGFR2 Y949 Cell Signaling, 4991S, dilution 1 500, VEGFR2 Y1173 Cell Signaling, 2478S, dilution 1 500, pLRP6 Cell Signaling, 2568S, dilution 1 300, LRP6 Cell Signaling, 3395S, dilution 1 500, ОІcatenin Cell Signaling, 8480S, dilution 1 2000, LEF1 Cell Signaling, 2230S, dilution 1 1000, ZO1 Invitrogen, 61 7300, dilution 1 2000, Occludin Invitrogen, 33 1500, dilution 1 1000, Caveolin 1 Cell Signaling, 3238S, dilution 1 2000, VE cadherin BD Pharmingen, 555289, dilution 1 500 and ОІactin Sigma, A1978, dilution 1 5000 <a href=https://fastpriligy.top/>buy priligy usa</a>