We all know how to find only duplicated rows in a table.
Since SQL Server 2005 this became really simple using a ROW_NUMBER() window function like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RN, * FROM Person.Address ) SELECT * FROM cteDupes WHERE RN > 1 ORDER BY AddressLine1, City, AddressID
The IO for this method is great. We get only one pass through the table:
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
However the problem with this code is that it returns only the duplicated rows and not the original row the duplicates originated from.
Good way
We want a fast way of getting the whole duplicated group. My first attempt was to use the Count(*) with OVER to get the group count in each row like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT COUNT(*) OVER(PARTITION BY AddressLine1, City ) AS CNT, * FROM Person.Address ) SELECT * FROM cteDupes WHERE CNT > 1 ORDER BY AddressLine1, City, AddressID
But looking at the IO this caused made my head explode and i wanted a better way.
Table 'Worktable'. Scan count 3, logical reads 117473, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Better way
We could get the same thing by using 2 ROW_NUMBER()-ed column, one in ascending and the other on descending order like this:
USE AdventureWorks GO ;WITH cteDupes AS ( -- find all rows that have the same AddressLine1 and City. -- we consider those rows are duplicates so we partition on them SELECT ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID DESC) AS RND, ROW_NUMBER() OVER(PARTITION BY AddressLine1, City ORDER BY AddressID) AS RNA, * FROM Person.Address ) SELECT * FROM cteDupes -- this condition removes the rows that don’t have duplicates -- RNA-RND = 0 gets all odd rows in the group: 1st, 3rd, etc... -- RNA = 1 AND RND = 1 limit those rows to those groups that don’t have duplicates -- because if there’s only one row both RNA and RND will be 1 -- by negating the condition we return the whole duplicated group WHERE NOT (RNA-RND = 0 AND RNA = 1 AND RND = 1)
ORDER BY AddressLine1, City, AddressID
This made the IO a normal single pass at the table.
Table 'Address'. Scan count 1, logical reads 280, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
If anyone has a better way do let me know.