Jesse's SQL Blog: Removing 100% Duplicates

Monday, December 16, 2019

Removing 100% Duplicates

In a table that contains duplicates as identified by some business rule (same name/address, etc.), but the rows included in the sets of duplicates are actually 100% the same, it's impossible to identify a "survivor" of a de-duplication process that deletes all but one of the rows in each set. There is a way in T-SQL to do this, by using the TOP(n) clause after the DELETE. It's kind of unusual, and you should be careful with it. Here's an example:

CREATE TABLE #src (x int, y int, z int)
GO
INSERT #src VALUES (1, 2, 3)
INSERT #src VALUES (1, 2, 3)
INSERT #src VALUES (1, 2, 3)
INSERT #src VALUES (10, 20, 30)
INSERT #src VALUES (100, 200, 300)
GO
SELECT * FROM #src

At this point, #src looks like this (I cannot seem to remove the odd white space in this table):
x y z
1 2 3
1 2 3
1 2 3
10 20 30
100 200 300

Now let's remove our duplicates:
DELETE TOP(2)
FROM #src
WHERE x = 1

SELECT * FROM #src

And we see the results:
x y z
1 2 3
10 20 30
100 200 300

In the DELETE TOP(n) part of the statement, n=#dupes - 1. In our case, we had 3 duplicates and removed 2 of them. Which brings up an important point - we used the "x=1" condition to identify the group, and we knew that it had exactly 3 rows in that dupe group. So if we generalize the problem, we may want to use dynamic SQL to build a solution:

; WITH DupeGroups AS (
SELECT
[x]
,[cnt] = COUNT(*)
FROM #src
GROUP BY [x]
HAVING COUNT(*) > 1
)
SELECT

[sql] = 'DELETE TOP(' + CONVERT(varchar, [cnt] - 1)

+ ') FROM #src WHERE [x] = ' + CONVERT(varchar, [x]) + ';'

FROM DupeGroups

This can then be executed using sp_SqlExec to remove the duplicates.

Jesse's SQL Blog

Monday, December 16, 2019

Removing 100% Duplicates

No comments:

Post a Comment

Followers

Blog Archive

About Me