Forum Moderators: coopster
This is the code...
while($row = mysql_fetch_array($result))
$sysid = ($row['sysid']);
$sc = ($row['scname']);
$pole = ($row['polenum']);
$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname LIKE '%Morgantown%'");
while($dow = mysql_fetch_array($q2))
{ $id2 = ($dow['sysid']);
$sc2 = ($dow['scname']);
$p2=($dow['polenum']);
if (($pole==$p2) && ($sc==$sc2) && ($sysid!=$id2))
{echo "<tr>";
echo "<td>" . $row['sysid'] . "</td>";
echo "<td>" . $row['scname'] . "</td>";
echo "<td>" . $row['polenum'] . "</td>";}
It works, but my PC takes hours to process, maybe it shouldn't even be trying? I'm hoping that someone spots a simpler logic test.
Thanks,
KP
Ok, so your basic issue is that you are performing queries within loops (shudder). Major efficiency killer (aside from unnecessary variable reassignment vs. just using the array element, e.g. $row['sysid'], something else that can just grind a script down if you have alot of variables in the namespace).
What you need to be doing is performing a JOIN statement [dev.mysql.com]. You don't show your first query statement, but I would bet you can join #1 and #2 into a single query that will return what you want. I would first take a look at your table(s), make sure they're properly indexed [dev.mysql.com], and then look at a way to join the tables so that you get all the results you want.
Unless I'm not seeing the big picture, 90% of these problems can be solved with proper indexes and a crafty join statement.
There are a ton of good database / SQL / JOIN tutorials out there, take a look at your favorite search engine for some.
As an aside, something else that can effect performance is the table / engine type. You might be using InnoDB by default, where MyISAM would probably be a better choice for fast table scans.
If you don't care to read the lengthy post and how I solved it, her's a quick one-liner you can try. :-) Change this
$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname LIKE '%Morgantown%'");
to this
$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname = 'Morgantown'");
Like works like a regular expression and is probably slowing it down a lot. If it won't work for you (morgantown) see post # 2. Or . . . .
-------------------
The task: million+ - records in one list, over a million in a second list. Take list a, compare it to list b, mark any emails you find in list b that exist in list a, log all duplicates, locate invalid email addresses. Client is doing this offline using MS programs and it was taking 15 hours.
This is a flat list, a single field, of email addresses in both lists.
My first thought was exactly as previously posted, just dump the data in a temp table and do some joins . . . . I do know how to index tables. This should be straightforward and easy.
Coded it up, fired it up . . . and my heart sunk. I did some calculating, and at the rate it was working it looked like it would take 55 hours.
That's not gonna' work . . .
Although I developed on my understressed dedi, I had no idea of what environment this client would work in, which made matters more tenuous.
I didn;t let go of the database idea. I tried different select methods, everything, it was all over the map.
As it turned out, I got the best results, believe it or not, by dumping only the list-to-check in a database and reading the "comparator" list line by line, in manageable chunks.
This is completely counter to everything I'd learned about databases and flat files, and it was an eyeopener. I'd literally set up a program that allowed you to select what to do. Put both in a db and compare, put only the list-to-check in the DB and read the comparator line by line, vice versa, and leave them both as plain text.
For obvious reasons, you can't just slurp a million lines of text into memory and hope it doesn't get corrupted - it does. I created a setting to read the comparator list in chunks. 1000 line chunks seemed to work very fast.
1. Store the list-to-check in a database table. You can use a temp, but I used a permanent table for the ability to check records and verify data. So obviously, the first thing you do is delete all records before storing the list-to-check (and flush out the error and dupe tables, below.)
2. Pull off 1000 lines of the comparator list into an array, mark the "progress" through the comparator, check each against the million+ line. From this list, if you find duplicates (or invalid syntax, or other problems that needed attention) store them in a separate array.
3. When this 1K is done,
- dump the dupes in the duplicate table, the errors in the error table
- flush out your arrays
- begin reading from where you left off for another 1K
- Rinse and repeat until that lovable "complete" message is revealed.
It gets even better - I did this with a forked process, allowing the client to click and walk away. No hanging white screens, no wondering if it worked and crashing the server by submitting over and over.
Submit, fork the child to do the work, return a "monitor" response page immediately to the parent. Within the response page (parent process,) a small iFrame set to refresh (at whatever setting you choose) would pull a count every x seconds and report to you where it is in the current list.
With the client's test parameters (1 million in the comparator, 1.5 million in the list-to-check, all uniques) I got it from 15 hours to 55 minutes.
One last item. I actually have 1.6m records to clean, but decided on 50k per shot.
1,000 times faster huh? I'm likin' that!
Here is both loops:
$result = mysql_query("SELECT sysid, polenum FROM kenpoles WHERE scname LIKE 'Morgantown'")
while($row = mysql_fetch_array($result))
{ $sysid = ($row['sysid']);
$pole = ($row['polenum']);
$q2 = mysql_query("SELECT sysid, polenum FROM kenpoles WHERE scname LIKE 'Morgantown'");
while($dow = mysql_fetch_array($q2))
{ $id2 = ($dow['sysid']);
$p2=($dow['polenum']);
if (($pole==$p2) && ($sysid!=$id2))
{echo "<tr>";
echo "<td>" . $row['sysid'] . "</td>";
echo "<td>" . $row['polenum'] . "</td>";}
}
Pretty straight forward. Outside loop loads first record, inner loop checks every record against it.
Thanks for taking a look at it...
KP
$polenum could be = 127744 on line #100 and $polenum=1x3209 on line #263 and $polenum=127744 on line #492... $polenum=127744 on line #499 and $polenum = 1x3209 on line #823.
These are errors as $polenum should not be repeated (and should represent a single record in the table)
So, I check a record against every other record in the same table. Hope that makes sense. KP
Try this self join:
SELECT
a.sysid AS first, b.sysid AS second, a.polenum
FROM kenpoles AS a
INNER JOIN kenpoles AS b
ON ( (a.polenum = b.polenum) AND (a.sysid <> b.sysid) )
WHERE scname = 'Morgantown'
GROUP BY a.polenum
I left off the GROUP BY and the conditional on the `sysid` field.
Second edit to clean it up.
Added So when you say you have 1.6m records, but are only processing 50K at a time, does that mean your table has 1.6m rows and approx 50K of those have
scname='Morgantown'? Or are you limiting your selection to 50K through some other approach?
Oh, one more thing.
From what I've read (not much yet) It looks like I'll have to have a primary or unique on one of the tables. Do both tables use/need a primary/unique? Thanks for the help... Most appreciated & I'll let you know the results!
KP
Regarding PRIMARY and UNIQUE, you should be using one or the other in *most* tables. In this situation, having a UNIQUE on polenum would have prevented duplicate entries from being created in the first place.
You'll definitely want an index of some type on each the fields involved with this query: polenum, sysid, and scname. If you don't already, that alone is going to speed up even your original code by a huge amount.
Let me make sure I understand what you have:
a new table named a = kenpoles
a new table named b = kenpoles
a new table/column named a.sysid = first
a new table/column named b.sysid = second
That's losing me. I thought AS populates a new table... If so, "a.sysid AS first" & "first" becomes a new table. But later, (a.sysid<>b.sysid) is used, rather than (first<>second).
I ran it at the sql prompt and rec'd this error:
error 1052(23000) column 'scname' in where clause is ambiguous.
So I changed WHERE scname = 'Morgantown' to WHERE (scname='Morgantown')
Ran it again & rec'd this error:
error 1054(42s22) unknown column 'a' in 'field list'
Am I missing something obvious? Thanks, KP
It should be:
WHERE a.scname = 'Morgantown'The column is ambiguous because it belongs in both tables in the query.
This is simply a self-join statement. You join the table to itself using an alias. The AS keyword just assigns an alias that you can refer to that column or table as. So `first` and `second` are just names to reference the same column `sysid`, but you obviously don't want two columns named `sysid` in the resultset - how would you refer to it? Using an alias clears this up. In addition, using the alias `a` and `b` allow you to join the table to itself.
Another good use of aliases is on long table names that you need to refer to several times in the query, and to reference data returned from a function. Here's an example:
SELECT
table1.column1
, DATE_FORMAT(table2.date_field, '%b %D') AS formatted_date
FROM this_table_has_a_long_name AS table1
INNER JOIN yet_another_long_table_name AS table2
USING( IDcolumn )
ORDER BY table1.column1
;