homepage Welcome to WebmasterWorld Guest from 54.226.230.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderators: physics

Databases Forum

    
Massive duplicate data search
CodilX




msg:4270980
 11:52 am on Feb 23, 2011 (gmt 0)

Hi there,

I have a big database of over 1 000 000 entries, and its growing.

I want to check for duplicate entries, but each method I try, it takes way too long, or timeouts.

Is there any way to accomplish this in a fast way?

My db consists of product rows, each with an image id. Sometimes two or more products get assigned the same image id, and I need to see which products have the same image id. I don't need to get this info straight from MySQL, I can do a second search with PHP to find the known duplicate image ids, and thus get the product ids.

But the search for duplicates is just taking way too long, any faster way?

I used this simple query, and it times out:

SELECT id, product_img COUNT(*) AS total FROM products GROUP BY product_img HAVING total > 1

My timeout is set to someting like ~2-3minutes.

 

coopster




msg:4271045
 2:31 pm on Feb 23, 2011 (gmt 0)

Can you view the error log? Because you have a syntax error there. You are missing a comma to separate your product_img column from your COUNT(*) column.

Also, when there is a
GROUP BY clause, there are certain rules that apply for grouping columns. One of those rules is "The Single-Value Rule" -- every column named in the SELECT list must also be a grouping column unless it is an argument for one of the set functions. MySQL extends standard SQL by allowing you to use columns or calculations in a SELECT list that don't appear in a GROUP BY clause. The MySQL docs once warned not to use this feature unless the columns you omit from the GROUP BY clause are not unique in the group because you will get unpredictable results. Times may have changed, but I believe it is still good practice to write standard SQL queries.

Next, you may need to use the aggregate function in your HAVING clause too, depending on what mode your MySQL server is running. I updated your query to show you what I mean:
SELECT 
id,
product_img,
COUNT(id) AS total
FROM products
GROUP BY
id,
product_img
HAVING COUNT(id) > 1
;

Finally, if the updated query still times out, try running it in chunks by adding a LIMIT clause.

By the way, this is only going to work if the combination of id/product_img are duplicates. Meaning, if your product_img is the "product" to which you are referring and it may be different than the identifier in the same record, you are not going to get your expected results.

LifeinAsia




msg:4271112
 4:32 pm on Feb 23, 2011 (gmt 0)

A few questions:
1) Is product_img an INT or VARCHAR field (pointing to an image), or is it the actual image (BLOB field)?
2) Do you have an index on id and product_img?

CodilX




msg:4274497
 10:22 am on Mar 1, 2011 (gmt 0)

coopster: sorry, I just rewrote my query here, not copied - so I'm sorry for the error :)

LifeinAsia: the product_img is varchar, because it can contain letters and other symbols.

There is no index for product_img, because its user added and simply points to a hashsum of an image, such as 0c3a3889dfdc5f6d5a1500a15b010fac

The main problem, as I understand with the query I posted above, is that since there are 1 000 000 + entries, every time it goes through 1 entry to check for duplicates, it goes through all of the 1 000 000+ rows. Right? So to check for duplicates, MySQL has to do 1 000 000 000 000 runs through the database.

So how do I make this optimized? :/

LifeinAsia




msg:4274678
 4:40 pm on Mar 1, 2011 (gmt 0)

There is no index for product_img, because its user added and simply points to a hashsum of an image, such as 0c3a3889dfdc5f6d5a1500a15b010fac

If it's the same image, isn't it going to have the same hashsum? Is so, put an index on it- being user added is irrelevant.

CodilX




msg:4275132
 8:37 am on Mar 2, 2011 (gmt 0)

The biggest problem is, that there can't be a check when adding products. The issue is, that someone adds a product list of about 100 000 products. To insure that the information is added quickly, I use MySQL LOAD DATA LOCAL INFILE after file upload, and modifying the import file. It takes too long importing that data and checking everything while doing the import, but when using the load data it takes literally seconds. So now I have to reverse the check for duplicates.

LifeinAsia




msg:4275376
 5:01 pm on Mar 2, 2011 (gmt 0)

OK, but that doesn't preclude the use of an index.

CodilX




msg:4277692
 7:24 am on Mar 7, 2011 (gmt 0)

So you suggest that I use an index on the product_img?

LifeinAsia




msg:4277860
 4:41 pm on Mar 7, 2011 (gmt 0)

Try it. If it doesn't help (which I doubt), you can always remove it.

brotherhood of LAN




msg:4277866
 4:48 pm on Mar 7, 2011 (gmt 0)

Yes an index would help

0c3a3889dfdc5f6d5a1500a15b010fac


Looks like an MD5 value, hopefully you're storing these as 16 byte binary rather than 32 char, which will be helpful when querying the newly created index.

CodilX




msg:4282327
 8:45 am on Mar 16, 2011 (gmt 0)

Well the problem that comes up, is that I already have duplicate values, and when I try to add a Unique index, it dies with an error that shows that I have a duplicate entry.

brotherhood of LAN




msg:4282330
 8:48 am on Mar 16, 2011 (gmt 0)

Trying using ALTER IGNORE TABLE , this will delete duplicates and create the unique key. Note that it will delete duplicates.

[dev.mysql.com...]

CodilX




msg:4282336
 9:25 am on Mar 16, 2011 (gmt 0)

Still spits out the same error..

LifeinAsia




msg:4282489
 3:29 pm on Mar 16, 2011 (gmt 0)

when I try to add a Unique index, it dies with an error

So don't make the index unique. If you know there are duplicates to begin with, trying to make a unique index is just asking for errors.

TheMadScientist




msg:4286913
 4:56 pm on Mar 24, 2011 (gmt 0)

COUNT(id)

1.) Add a non-unique index as suggested.

2.) MySQL counts are optimized for count(*) if I remember correctly (double check), but I think I remember reading the count of indexed data is optimized for a count(*) over a specific col. (I don't remember all the ins and outs, but I know for certain queries related to counts it is -or was- better to use the * than something more specific because of the way an index is created.)

3.) Chunk if necessary, as coopster suggests.

I've looped through text before and actually had a low timeout to work with, so I set the loop to where it would complete in slightly less than the timeout limit reliably, put it on an HTML page with a meta refresh that fired at slightly less than the timeout, opened it in a browser window and let it run.

I cut down manual refreshes from once every timeout to once every 30 minutes or so.

brotherhood of LAN




msg:4286934
 5:31 pm on Mar 24, 2011 (gmt 0)

2.) MySQL counts are optimized for count(*) if I remember correctly (double check), but I think I remember reading the count of indexed data is optimized for a count(*) over a specific col. (I don't remember all the ins and outs, but I know for certain queries related to counts it is -or was- better to use the * than something more specific because of the way an index is created.)


MyISAM tables are optimized for COUNT(*) with no WHERE clause because they store the row count as meta data, the query appears instant because the query doesn't actually scan any indexes or data. Transactional engines like InnoDB are not 'optimized' as such for COUNT(*) because of their transactional nature, the table can be in more than one state at any one time.

Anyway, assuming it is a MyISAM table, any WHERE clause for COUNT() will cause an index or table scan so an index is preferable.

TheMadScientist




msg:4287015
 7:43 pm on Mar 24, 2011 (gmt 0)

Thanks, been a long time since I've read all that stuff and it's filed away somewhere, so I only remember bits and pieces at times ... I always index anything I will be selecting by as habit, and really don't understand why people wouldn't? I guess at times you'll hit an index max size, but for the most part, imo, if you're using it in the 'where' you should find any way to index it ... Am I missing something?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved