SQL Server - How to get the whole group of duplicate rows

We all know how to find only duplicated rows in a table.

Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them    
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   RN > 1
ORDER BY AddressLine1, City, AddressID
 

The IO for this method is great. We get only one pass through the table:

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.

Good way

We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
WHERE   CNT > 1
ORDER BY AddressLine1, City, AddressID
 

But looking at the IO this caused made my head explode and i wanted a better way.

Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Better way

We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:

USE AdventureWorks
GO
;WITH cteDupes AS 
(
    -- find all rows that have the same AddressLine1 and City. 
    -- we consider those rows are duplicates so we partition on them
    SELECT  ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND,
            ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID)      AS RNA,
            *
    FROM    Person.Address
)
SELECT  *
FROM    cteDupes
-- this condition removes the rows that don’t have duplicates 
-- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc...
-- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates 
-- because if there’s only one row both RNA and RND will be 1
-- by negating the condition we return the whole duplicated group
WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)
ORDER BY AddressLine1, City, AddressID
 

This made the IO a normal single pass at the table.

Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 

 

If anyone has a better way do let me know.

kick it on DotNetKicks.com


Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.