|Selecting data from one table that's not in another|
Hi there Guru's,
I posted this same topic last year (almost to the date!) First try! [webmasterworld.com]
I've got an issue that's driving me mad, this code:-
Select serialnumber from
where id_number not in
(SELECT serialnumber FROM newbuild
Works fine on my mock table that's got about 20 rows in it (trialing the code first), then when I try it on my test server which compares the data of one table (serial, approx 6000 rows), to the other (newbuild, approx 10000 rows).
This takes in excess of 1 min. 50 secs to complete, and constantly returns ALL rows from the serial table.
I am stumped; the two rows being evaluated are both primary keys in their repective tables and are both ints.
I've checked the collation of each table and made sure that they are identical in all structure.
Why would it work on 20 rows with a 4 row check where 1 row isn't in the 20 row table!
Is there a more efficient way of checking for data present in on table that isn't in another?
Failing this; I will re design the database to be more fluent and less clunky! I've inherited this system from someone who wasn't that bothered about longevity of task.
Have you tried EXPLAIN?
Does using a JOIN as suggested in the previous thread make a difference?
Never use not in, it's terrible. Use an outer join and test for null instead.
Hi there Guy's,
After MUCH deliberation and the help of going through each row manualy myself I have figured out a potential reason for why this is so slow. Yes the code works well, does what it is meant to (ameliorating my last "returns all rows" comment, it returns exactly what it should, just my interpretation of data was fuddled!), but it seems that there are ALOT more records NOT IN than are in. I was expecting there to be IRO 50-100 records difference between the two tables, but it seems that the ratio is more like 5:1 which is why (I think) there is a huge delay whilst this list is compiled.
If this logic isn't correct, then I suspect that I need to change the query and seeing as "JOIN" is being suggested, I need to try this again..
@DaveWave: I couldn't get JOIN to work, so I gave up for other methods...
@Graemw_p: I'm not sure about explain, could you give me some context on that please? I'm knee deep in a C project at the moment and this has taken a back seat for a few days, but I can't help but still ponder it!
Again, thanks for the advice.
Select s.serialnumber from
Left outer Join newbuild n on n.serialnumber =s.id_number
where id_number is null
Not sure why you have serialnumber in table Serial and you are testing for id_number in table newbuilds, bit confusuing to me but I've just taken your query and converted it to an outer join.
In general EXPLAIN will show you exactly what a query does. For example it will show if a subquery is executing just once of repeatedly, whether it is being optimised to a join, etc.
Make sure your tables are indexed properly- there's no reason it should take so long with such small tables. I do similar joins on tables with millions of rows and get results in less than a few seconds (and most of the time is actually from displaying the results, not the actually pulling of data).
Are you on shared hosting? Things could be slow from some other processing hog(s).
EXPLAIN will also tell you if the indices are being used as expected.
Also, it may help at this point to share (the relevants bits of) your database schema.
Hi there Guys,
Thanks for all of your advice on this, I'll report back soon as it will make a huge difference to a project that just keeps getting larger and larger!
I'll check out explain too, looks like it might help.