Saturday, January 12. 2008Deleting Duplicate Records in a TablePrinter FriendlyRecommended Books: SQL Cookbook SQL for Mere Mortals
Comments
Display comments as
(Linear | Threaded)
Have you checked if this is faster or slower than the form I've seen used many times and have gotten used to:
delete from tab a where exists (select 1 from tab b where a.uniq1=b.uniq1 and a.uniq2=b.uniq2 and a.prkey>b.prkey)
Glad you asked. For this particular example we chose not to show that approach since it was considerably slower than the above (so slower we don't bother waiting for it to finish). I suspect it depends on if you have indexes on the dupe fields and the ratio of duplicates to non-dupes. This example is odd in that there are more duplicates than actual rows we are keeping. So we may try it when its the reverse case and it may win out.
Just FYI. Writing this example with the exists would be something like DELETE FROM duptest a WHERE EXISTS (SELECT 1 FROM duptest b WHERE a.first_name=b.first_name and a.last_name=b.last_name and a.name_key < b.name_key)
This method depends upon a unique id. If an auto-number wasn't designed onto a table, the table.CTID could be used in-place of this.
Since the CTID is a postgresql-ism, some don't like to use it. But it is an option that is available to use in a case like this.
This post helped a lot. The only problem was type casting. I couldn't run query without it on pgsql 8.1. Here's modified version:
DELETE FROM duptest WHERE textin(tidout(ctid)) NOT IN (SELECT max(textin(tidout(t1.ctid))) FROM duptest AS t1 GROUP BY t1.dupid);
In your preferred method, doesn't also mean that the delete query itself requires you to have at least enough hard drive space available to effectively cache your entire database over again?
You mean entire table right? If you don't have enough disk space to cache a single table in your database, then you have serious problems anyway.
The approach with the ctid seems to be impractical for tables with a a lot of records. I tested it with PostgreSQL 8.2 on a table with 8 million rows and cancelled the statement after 15 hours. For every record the server has to check whether its ctid is outside of a a set of 8 million minus N ctids where N is the number of duplicate rows.
In PostgreSQL 8.4 you can do the same more efficiently with window functions in a subselect (count with partion over). I resorted to an approach in Perl: SELECT ctid, first_name, last_name FROM duptest ORDER BY first_name, last_name Then you iterate over the result set and delete all rows where first_name and last_name is equal to the row before. Since you only need the ctid of the duplicte rows in the where clause of the delete statement, you don't need any index for that approach.
Your query is indeed much faster, but it deletes only one duplicate per entry at a time --> you have to run it repeatedly if you have more than one duplicate.
I tried the SQL code here and it was quite slow on a table with 750,000 records. It kept timing out on me. A little tweaking I managed to get it down to under a second.
DELETE FROM sometable WHERE uniquefield IN (SELECT max(uniquefield) FROM sometable GROUP BY dupcol1 HAVING Count(dupcol1)>1); Now I am using only one dupcolumn here, haven't tried to make it work with more than one. Simply replace uniquefield with ctid if you must. Will take 2 seconds instead of 1.
I've realized, that using WHERE instead of grouping, I get much much better performance when deleting from a large table.
Using the initial example, the query would look like: DELETE FROM duptest d1 WHERE EXISTS (SELECT 'x' FROM duptest d2 WHERE d1.first_name = d2.first_name AND d1.last_name = d2.last_name AND d1.unique_id < d2.unique_id); One would need a unique column (PK column perhaps) as the unique_id column, and an index on the first_name and last_name column would be preferable. On my setup (with 3 grouping columns) it takes 17 seconds to delete around 76,000 duplicate rows from around 1,2 million rows of data. The solution with grouping the columns takes much much longer for me.
I tried both the queries suggested by this blog post, and the variation described in comment #8, on a table with 25 million entries and no unique key (so using ctid).
both queries took forever (I cancelled them after 6 days each) and used a huge amount of resources (the first one mainly CPU, the second up to 26GB of RAM). in the end, I wrote a simple perl script which finished in a few hours (using a temporary table to store the unique entries).
name | phone | birth_date | balance
------+----------+------------+--------- a | 555-8628 | 1988-06-10 | 23.00 b | 555-0780 | 1986-12-02 | 25.00 c | 555-5898 | 1965-06-14 | 46.00 d | 555-5797 | 1961-03-18 | 48.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 e | 555-7656 | 1990-09-05 | 21.00 Please lemme know how to delete these multiple rows from this table.. I tried the above ways but im not getting the desired result except errors.. |
Entry's LinksQuicksearchCalendar
Categories
Blog Administration |
Tracked: Nov 24, 10:54
Tracked: Nov 24, 21:43