Deleting Duplicate Records in a Table - Postgres OnLine Journal

Question:

How do you delete duplicate rows in a table and still maintain one copy of the duplicate?

Answer:

There are a couple of ways of doing this and approaches vary based on how big your table is, whether you have constraints in place, how programming intensive you want to go, whether you have a surrogate key and whether or not you have the luxury of taking a table down. Approaches vary from using subselects, dropping a table and rebuilding using a distinct query from temp table, and using non-set based approaches such as cursors.

The approach we often use is this one:


DELETE
FROM 	sometable
WHERE 	someuniquekey NOT IN
	(SELECT 	MAX(dup.someuniquekey)
        FROM   		sometable As dup
        GROUP BY 	dup.dupcolumn1, dup.dupcolumn2, dup.dupcolum3)

We prefer this approach for the following reasons

Its the simplest to implement
It works equally well across many relational databases
It does not require you to take a table offline, but of course if you have a foreign key constraint in place, you will need to move the related child records before you can delete the parent.
You don't have to break relationships to do this as you would with drop table approaches

The above presumes you have some sort of unique/primary key such as a serial number (e.g. autonumber, identity) or some character field with a primary or unique key constraint that prevents duplicates. Primary candidates are serial key or OID if you still build your tables WITH OIDs.

If you don't have any of these unique keys, can you still use this technique? In PostgreSQL you can, but in other databases such as SQL Server - you would have to add a dummy key first and then drop it afterward. The reason you can always use this technique in Postgres is because PostgreSQL has another hidden key for every record, and that is the ctid. The ctid field is a field that exists in every PostgreSQL table and is unique for each record in a table and denotes the location of the tuple. Below is a demonstration of using this ctid to delete records. Keep in mind only use the ctid if you have absolutely no other unique identifier to use. A regularly indexed unique identifier will be more efficient.


--Create dummy table with dummy data that has duplicates
CREATE TABLE duptest
(
  first_name character varying(50),
  last_name character varying(50),
  mi character(1),
  name_key serial NOT NULL,
  CONSTRAINT name_key PRIMARY KEY (name_key)
)
WITH (OIDS=FALSE);

INSERT INTO duptest(first_name, last_name, mi)
SELECT chr(65 + mod(f,26)), chr(65 + mod(l,26)), 
CASE WHEN f = (l + 2) THEN chr(65 + mod((l + 2), 26)) ELSE NULL END 
FROM 
	generate_series(1,1000) f
	CROSS JOIN generate_series(1,456) l;
	
--Verify how many unique records we have -
--We have 676 unique sets out of 456,000 records
SELECT first_name, last_name, COUNT(first_name) As totdupes
FROM duptest 
GROUP BY first_name, last_name;
	
--Query returned successfully: 455324 rows affected, 37766 ms execution time.
DELETE FROM duptest
	WHERE 	ctid NOT IN
	(SELECT 	MAX(dt.ctid)
        FROM   		duptest As dt
        GROUP BY 	dt.first_name, dt.last_name);

--Same query but using name_key		
--Query returned successfully: 455324 rows affected, 3297 ms execution time.
DELETE FROM duptest
	WHERE 	name_key NOT IN
	(SELECT 	MAX(dt.name_key)
        FROM   		duptest As dt
        GROUP BY 	dt.first_name, dt.last_name);
		
--Verify we have 676 records in our table
SELECT COUNT(*) FROM duptest;

A slight variation on the above approach is to use a DISTINCT ON query. This one will only work in PostgreSQL since it uses the DISTINCT ON feature of PostgreSQL, but it does have the advantage of allowing you to selectively pick which record to keep based on which has the most information. e.g. in this example we prefer records that have a middle initial vs. ones that do not. The downside of using the DISTINCT ON, is that you really need a real key. You can't use the secret ctid field, but you can use an oid field. Below is the same query but using DISTINCT ON


--Repeat same steps above except using a DISTINCT ON query instead of MAX query
--Query returned successfully: 455324 rows affected, 5422 ms execution time. 
DELETE FROM duptest
	WHERE duptest.name_key
	NOT IN(SELECT DISTINCT ON (dt.first_name, dt.last_name) 
        dt.name_key
       FROM duptest dt 
       ORDER BY dt.first_name, dt.last_name, COALESCE(dt.mi, '') DESC) ;

Note: for the above if you want to selectively pick records say on which ones have the most information, you can change the order by to something like this

ORDER BY dt.first_name, dt.last_name, (CASE WHEN dt.mi > '' THEN 1 ELSE 0 END + CASE WHEN dt.address > '' THEN 1 ELSE 0 END ..etc) DESC