Removing dupes. again

Forum Moderators: open

Message Too Old, No Replies

Removing dupes. again

httpwebwitch

2:59 am on Mar 5, 2009 (gmt 0)

I know this must be a FAQ... Every time I need to do this, I end up looking it up. Why wouldn't SQL DB vendors figure out that dupe removal is a common task, perhaps deserving a native command or utility?

Anyways, I have this table with two int columns.
The data looks like this:

1 3
1 4
2 6
2 7
3 1
3 4
3 6
3 8
...
14225 838827
14225 838828
14225 838892
...

it goes on like that... for over 12 million rows.

Anyways, I don't need duplicates in there. And there are plenty.

Not only are there multiple identical rows like this,
1 2
1 2
1 2

but there are "flipped" rows, which (for my purposes) are identical, like:

1 3
1 3
3 1
3 1

what's the easiest way to dedupe this table?

carguy84

1:37 am on Mar 10, 2009 (gmt 0)

Is there another column to uniquely identify a row?

For future reference, once you have removed all the dupes, you should set a primary key on those columns and avoid the issue in the future if possible.

httpwebwitch

11:59 am on Mar 10, 2009 (gmt 0)

no other column to id the rows. They're just two rows with two ints in them.

BTW I'm using MySQL

carguy84

12:14 am on Mar 11, 2009 (gmt 0)

Hmm, that is a tricky one then as you can't even say:

delete from table where column1 = 10 and column2 = 25 otherwise it will delete the dupe and the original.

I think you might have to select out dupes using GROUP BY, write down which ones have dupes, then manually delete them.

Create a third column as a rowId and then write a small program to loop through the table and remove the duplicates via their rowId.

Personally, I'd choose the second option and then just removed the rowId column and setup the primary key correctly.

piatkow

7:10 am on Mar 12, 2009 (gmt 0)

I have no MySQL experience but having spent a big slice of the last decade as a DBA my generic solution is below.

Clearly the table is incorrectly keyed and indexed.
Create a copy set up correctly
Insert the records, the dupes should simply fail.
Replace the "bad" table

httpwebwitch

12:05 pm on Mar 12, 2009 (gmt 0)

write down which ones have dupes, then manually delete them

I figure to do that with 12 million records, if I could do one every second, working 24h a day without sleep or interruption, I'd be finished in just a little over 9 years. I'd probably also go through a few thousand books of paper and ball-point pens

piatkow's solution seems pretty solid... I'll try it. I wonder how long it'll take to run that one!

isn't there a query I can do using DISTINCT and HAVING? hmm

It perplexes me why isn't this built into the SQL language

carguy84

7:13 pm on Mar 14, 2009 (gmt 0)

It's not built in because it is solved with primary keys. You should be able to insert into a temp table with "select distinct". Copying one table to another that has properly setup primary keys will fail and not resume on failure. You'll have to go the temp table route.