| 2:31 pm on Feb 23, 2011 (gmt 0)|
Can you view the error log? Because you have a syntax error there. You are missing a comma to separate your product_img column from your COUNT(*) column.
Also, when there is a GROUP BY clause, there are certain rules that apply for grouping columns. One of those rules is "The Single-Value Rule" -- every column named in the SELECT list must also be a grouping column unless it is an argument for one of the set functions. MySQL extends standard SQL by allowing you to use columns or calculations in a SELECT list that don't appear in a GROUP BY clause. The MySQL docs once warned not to use this feature unless the columns you omit from the GROUP BY clause are not unique in the group because you will get unpredictable results. Times may have changed, but I believe it is still good practice to write standard SQL queries.
Next, you may need to use the aggregate function in your HAVING clause too, depending on what mode your MySQL server is running. I updated your query to show you what I mean:
COUNT(id) AS total
HAVING COUNT(id) > 1
Finally, if the updated query still times out, try running it in chunks by adding a LIMIT clause.
By the way, this is only going to work if the combination of id/product_img are duplicates. Meaning, if your product_img is the "product" to which you are referring and it may be different than the identifier in the same record, you are not going to get your expected results.
| 4:32 pm on Feb 23, 2011 (gmt 0)|
A few questions:
1) Is product_img an INT or VARCHAR field (pointing to an image), or is it the actual image (BLOB field)?
2) Do you have an index on id and product_img?
| 10:22 am on Mar 1, 2011 (gmt 0)|
coopster: sorry, I just rewrote my query here, not copied - so I'm sorry for the error :)
LifeinAsia: the product_img is varchar, because it can contain letters and other symbols.
There is no index for product_img, because its user added and simply points to a hashsum of an image, such as 0c3a3889dfdc5f6d5a1500a15b010fac
The main problem, as I understand with the query I posted above, is that since there are 1 000 000 + entries, every time it goes through 1 entry to check for duplicates, it goes through all of the 1 000 000+ rows. Right? So to check for duplicates, MySQL has to do 1 000 000 000 000 runs through the database.
So how do I make this optimized? :/
| 4:40 pm on Mar 1, 2011 (gmt 0)|
|There is no index for product_img, because its user added and simply points to a hashsum of an image, such as 0c3a3889dfdc5f6d5a1500a15b010fac |
If it's the same image, isn't it going to have the same hashsum? Is so, put an index on it- being user added is irrelevant.
| 8:37 am on Mar 2, 2011 (gmt 0)|
The biggest problem is, that there can't be a check when adding products. The issue is, that someone adds a product list of about 100 000 products. To insure that the information is added quickly, I use MySQL LOAD DATA LOCAL INFILE after file upload, and modifying the import file. It takes too long importing that data and checking everything while doing the import, but when using the load data it takes literally seconds. So now I have to reverse the check for duplicates.
| 5:01 pm on Mar 2, 2011 (gmt 0)|
OK, but that doesn't preclude the use of an index.
| 7:24 am on Mar 7, 2011 (gmt 0)|
So you suggest that I use an index on the product_img?
| 4:41 pm on Mar 7, 2011 (gmt 0)|
Try it. If it doesn't help (which I doubt), you can always remove it.
|brotherhood of LAN|
| 4:48 pm on Mar 7, 2011 (gmt 0)|
Yes an index would help
Looks like an MD5 value, hopefully you're storing these as 16 byte binary rather than 32 char, which will be helpful when querying the newly created index.
| 8:45 am on Mar 16, 2011 (gmt 0)|
Well the problem that comes up, is that I already have duplicate values, and when I try to add a Unique index, it dies with an error that shows that I have a duplicate entry.
|brotherhood of LAN|
| 8:48 am on Mar 16, 2011 (gmt 0)|
Trying using ALTER IGNORE TABLE , this will delete duplicates and create the unique key. Note that it will delete duplicates.
| 9:25 am on Mar 16, 2011 (gmt 0)|
Still spits out the same error..
| 3:29 pm on Mar 16, 2011 (gmt 0)|
|when I try to add a Unique index, it dies with an error |
So don't make the index unique. If you know there are duplicates to begin with, trying to make a unique index is just asking for errors.
| 4:56 pm on Mar 24, 2011 (gmt 0)|
1.) Add a non-unique index as suggested.
2.) MySQL counts are optimized for count(*) if I remember correctly (double check), but I think I remember reading the count of indexed data is optimized for a count(*) over a specific col. (I don't remember all the ins and outs, but I know for certain queries related to counts it is -or was- better to use the * than something more specific because of the way an index is created.)
3.) Chunk if necessary, as coopster suggests.
I've looped through text before and actually had a low timeout to work with, so I set the loop to where it would complete in slightly less than the timeout limit reliably, put it on an HTML page with a meta refresh that fired at slightly less than the timeout, opened it in a browser window and let it run.
I cut down manual refreshes from once every timeout to once every 30 minutes or so.
|brotherhood of LAN|
| 5:31 pm on Mar 24, 2011 (gmt 0)|
|2.) MySQL counts are optimized for count(*) if I remember correctly (double check), but I think I remember reading the count of indexed data is optimized for a count(*) over a specific col. (I don't remember all the ins and outs, but I know for certain queries related to counts it is -or was- better to use the * than something more specific because of the way an index is created.) |
MyISAM tables are optimized for COUNT(*) with no WHERE clause because they store the row count as meta data, the query appears instant because the query doesn't actually scan any indexes or data. Transactional engines like InnoDB are not 'optimized' as such for COUNT(*) because of their transactional nature, the table can be in more than one state at any one time.
Anyway, assuming it is a MyISAM table, any WHERE clause for COUNT() will cause an index or table scan so an index is preferable.
| 7:43 pm on Mar 24, 2011 (gmt 0)|
Thanks, been a long time since I've read all that stuff and it's filed away somewhere, so I only remember bits and pieces at times ... I always index anything I will be selecting by as habit, and really don't understand why people wouldn't? I guess at times you'll hit an index max size, but for the most part, imo, if you're using it in the 'where' you should find any way to index it ... Am I missing something?