How do you delete duplicate rows in a table and still maintain one copy of the duplicate?
Answer:There are a couple of ways of doing this and approaches vary based on how big your table is, whether you have constraints in place, how programming intensive you want to go, whether you have a surrogate key and whether or not you have the luxury of taking a table down. Approaches vary from using subselects, dropping a table and rebuilding using a distinct query from temp table, and using non-set based approaches such as cursors.
The approach we often use is this one:
DELETE
FROM sometable
WHERE someuniquekey NOT IN
(SELECT MAX(dup.someuniquekey)
FROM sometable As dup
GROUP BY dup.dupcolumn1, dup.dupcolumn2, dup.dupcolum3)
We prefer this approach for the following reasons
The above presumes you have some sort of unique/primary key such as a serial number (e.g. autonumber, identity) or some character field with a primary or unique key constraint that prevents duplicates. Primary candidates are serial key or OID if you still build your tables WITH OIDs.
If you don't have any of these unique keys, can you still use this technique? In PostgreSQL you can, but in other databases such as SQL Server - you would have to add a dummy key first and then drop it afterward. The reason you can always use this technique in Postgres is because PostgreSQL has another hidden key for every record, and that is the ctid. The ctid field is a field that exists in every PostgreSQL table and is unique for each record in a table and denotes the location of the tuple. Below is a demonstration of using this ctid to delete records. Keep in mind only use the ctid if you have absolutely no other unique identifier to use. A regularly indexed unique identifier will be more efficient.
--Create dummy table with dummy data that has duplicates
CREATE TABLE duptest
(
first_name character varying(50),
last_name character varying(50),
mi character(1),
name_key serial NOT NULL,
CONSTRAINT name_key PRIMARY KEY (name_key)
)
WITH (OIDS=FALSE);
INSERT INTO duptest(first_name, last_name, mi)
SELECT chr(65 + mod(f,26)), chr(65 + mod(l,26)),
CASE WHEN f = (l + 2) THEN chr(65 + mod((l + 2), 26)) ELSE NULL END
FROM
generate_series(1,1000) f
CROSS JOIN generate_series(1,456) l;
--Verify how many unique records we have -
--We have 676 unique sets out of 456,000 records
SELECT first_name, last_name, COUNT(first_name) As totdupes
FROM duptest
GROUP BY first_name, last_name;
--Query returned successfully: 455324 rows affected, 37766 ms execution time.
DELETE FROM duptest
WHERE ctid NOT IN
(SELECT MAX(dt.ctid)
FROM duptest As dt
GROUP BY dt.first_name, dt.last_name);
--Same query but using name_key
--Query returned successfully: 455324 rows affected, 3297 ms execution time.
DELETE FROM duptest
WHERE name_key NOT IN
(SELECT MAX(dt.name_key)
FROM duptest As dt
GROUP BY dt.first_name, dt.last_name);
--Verify we have 676 records in our table
SELECT COUNT(*) FROM duptest;
A slight variation on the above approach is to use a DISTINCT ON query. This one will only work in PostgreSQL since it uses the DISTINCT ON feature of PostgreSQL, but it does have the advantage of allowing you to selectively pick which record to keep based on which has the most information. e.g. in this example we prefer records that have a middle initial vs. ones that do not. The downside of using the DISTINCT ON, is that you really need a real key. You can't use the secret ctid field, but you can use an oid field. Below is the same query but using DISTINCT ON
--Repeat same steps above except using a DISTINCT ON query instead of MAX query
--Query returned successfully: 455324 rows affected, 5422 ms execution time.
DELETE FROM duptest
WHERE duptest.name_key
NOT IN(SELECT DISTINCT ON (dt.first_name, dt.last_name)
dt.name_key
FROM duptest dt
ORDER BY dt.first_name, dt.last_name, COALESCE(dt.mi, '') DESC) ;
Note: for the above if you want to selectively pick records say on which ones have the most information, you can change the order by to something like this
ORDER BY dt.first_name, dt.last_name, (CASE WHEN dt.mi > '' THEN 1 ELSE 0 END + CASE WHEN dt.address > '' THEN 1 ELSE 0 END ..etc) DESC