Better idea welcome

Forum Moderators: coopster

Message Too Old, No Replies

Better idea welcome

wkpride

3:29 am on Feb 25, 2009 (gmt 0)

Hi,
the short script will read a record, store the value, and then begin checking the entire database for a match (duplicate). After the check, on to the next record. The average database has 50,000 records...
I'd like to hear thoughts on a better (faster way) to execute. I'm using Wamp server on a new laptop... FYI.

This is the code...

while($row = mysql_fetch_array($result))
$sysid = ($row['sysid']);
$sc = ($row['scname']);
$pole = ($row['polenum']);

$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname LIKE '%Morgantown%'");

while($dow = mysql_fetch_array($q2))
{ $id2 = ($dow['sysid']);
$sc2 = ($dow['scname']);
$p2=($dow['polenum']);

if (($pole==$p2) && ($sc==$sc2) && ($sysid!=$id2))
{echo "<tr>";
echo "<td>" . $row['sysid'] . "</td>";
echo "<td>" . $row['scname'] . "</td>";
echo "<td>" . $row['polenum'] . "</td>";}

It works, but my PC takes hours to process, maybe it shouldn't even be trying? I'm hoping that someone spots a simpler logic test.

Thanks,

blang

3:52 am on Feb 25, 2009 (gmt 0)

Holy crow, I hope your laptop is only a development box and not your server platform! Seriously, though, you can only really get an idea of how well your queries are going to perform on 'real' hardware.

Ok, so your basic issue is that you are performing queries within loops (shudder). Major efficiency killer (aside from unnecessary variable reassignment vs. just using the array element, e.g. $row['sysid'], something else that can just grind a script down if you have alot of variables in the namespace).

What you need to be doing is performing a JOIN statement [dev.mysql.com]. You don't show your first query statement, but I would bet you can join #1 and #2 into a single query that will return what you want. I would first take a look at your table(s), make sure they're properly indexed [dev.mysql.com], and then look at a way to join the tables so that you get all the results you want.

Unless I'm not seeing the big picture, 90% of these problems can be solved with proper indexes and a crafty join statement.

There are a ton of good database / SQL / JOIN tutorials out there, take a look at your favorite search engine for some.

As an aside, something else that can effect performance is the table / engine type. You might be using InnoDB by default, where MyISAM would probably be a better choice for fast table scans.

rocknbil

4:21 am on Feb 25, 2009 (gmt 0)

This one catches my eye because I had almost the exact same project a few months back.

If you don't care to read the lengthy post and how I solved it, her's a quick one-liner you can try. :-) Change this

$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname LIKE '%Morgantown%'");

to this

$q2 = mysql_query("SELECT sysid, scname, polenum FROM kenpoles WHERE scname = 'Morgantown'");

Like works like a regular expression and is probably slowing it down a lot. If it won't work for you (morgantown) see post # 2. Or . . . .

-------------------

The task: million+ - records in one list, over a million in a second list. Take list a, compare it to list b, mark any emails you find in list b that exist in list a, log all duplicates, locate invalid email addresses. Client is doing this offline using MS programs and it was taking 15 hours.

This is a flat list, a single field, of email addresses in both lists.

My first thought was exactly as previously posted, just dump the data in a temp table and do some joins . . . . I do know how to index tables. This should be straightforward and easy.

Coded it up, fired it up . . . and my heart sunk. I did some calculating, and at the rate it was working it looked like it would take 55 hours.

That's not gonna' work . . .

Although I developed on my understressed dedi, I had no idea of what environment this client would work in, which made matters more tenuous.

I didn;t let go of the database idea. I tried different select methods, everything, it was all over the map.

As it turned out, I got the best results, believe it or not, by dumping only the list-to-check in a database and reading the "comparator" list line by line, in manageable chunks.

This is completely counter to everything I'd learned about databases and flat files, and it was an eyeopener. I'd literally set up a program that allowed you to select what to do. Put both in a db and compare, put only the list-to-check in the DB and read the comparator line by line, vice versa, and leave them both as plain text.

For obvious reasons, you can't just slurp a million lines of text into memory and hope it doesn't get corrupted - it does. I created a setting to read the comparator list in chunks. 1000 line chunks seemed to work very fast.

1. Store the list-to-check in a database table. You can use a temp, but I used a permanent table for the ability to check records and verify data. So obviously, the first thing you do is delete all records before storing the list-to-check (and flush out the error and dupe tables, below.)

2. Pull off 1000 lines of the comparator list into an array, mark the "progress" through the comparator, check each against the million+ line. From this list, if you find duplicates (or invalid syntax, or other problems that needed attention) store them in a separate array.

3. When this 1K is done,
- dump the dupes in the duplicate table, the errors in the error table
- flush out your arrays
- begin reading from where you left off for another 1K
- Rinse and repeat until that lovable "complete" message is revealed.

It gets even better - I did this with a forked process, allowing the client to click and walk away. No hanging white screens, no wondering if it worked and crashing the server by submitting over and over.

Submit, fork the child to do the work, return a "monitor" response page immediately to the parent. Within the response page (parent process,) a small iFrame set to refresh (at whatever setting you choose) would pull a count every x seconds and report to you where it is in the current list.

With the client's test parameters (1 million in the comparator, 1.5 million in the list-to-check, all uniques) I got it from 15 hours to 55 minutes.

whoisgregg

2:58 pm on Feb 25, 2009 (gmt 0)

wkpride, please post your first query that you use for the outer loop and I promise you we can give you a new single query that completes a thousand times faster.

The average database has 50,000 records

Shouldn't be a problem. That's not very big in the world of MySQL. :)

wkpride

12:33 am on Feb 26, 2009 (gmt 0)

Howdy,
Thanks for the replies! I did a quick change from '%morgantown%' to plain 'morgantown' And eliminated one unneeded variable load. I ran it on a smaller section of 17,000 records and it was finished in less than 1 hr.
I've started reading about JOIN statement & will need to study & fiddle with it before giving it a go.

One last item. I actually have 1.6m records to clean, but decided on 50k per shot.

1,000 times faster huh? I'm likin' that!

Here is both loops:

$result = mysql_query("SELECT sysid, polenum FROM kenpoles WHERE scname LIKE 'Morgantown'")

while($row = mysql_fetch_array($result))
{ $sysid = ($row['sysid']);
$pole = ($row['polenum']);

$q2 = mysql_query("SELECT sysid, polenum FROM kenpoles WHERE scname LIKE 'Morgantown'");

while($dow = mysql_fetch_array($q2))
{ $id2 = ($dow['sysid']);
$p2=($dow['polenum']);

if (($pole==$p2) && ($sysid!=$id2))
{echo "<tr>";
echo "<td>" . $row['sysid'] . "</td>";
echo "<td>" . $row['polenum'] . "</td>";}
}

Pretty straight forward. Outside loop loads first record, inner loop checks every record against it.

Thanks for taking a look at it...

blang

12:46 am on Feb 26, 2009 (gmt 0)

Huh? You're performing the exact same query twice. What is the purpose of that? What exactly are you trying to do? There may be an even simpler SELECT statement without having to perform a JOIN.

BTW, if you're going to use 'Morgantown' (no wildcard character '%'), then omit the LIKE and just use =.

wkpride

1:10 am on Feb 26, 2009 (gmt 0)

There are an unknown number of duplicate $polenum(s). There could be 5 duplicates of the same $polenum. There could be 2...

$polenum could be = 127744 on line #100 and $polenum=1x3209 on line #263 and $polenum=127744 on line #492... $polenum=127744 on line #499 and $polenum = 1x3209 on line #823.

These are errors as $polenum should not be repeated (and should represent a single record in the table)

So, I check a record against every other record in the same table. Hope that makes sense. KP

blang

1:29 am on Feb 26, 2009 (gmt 0)

Ah ok. Is there some reason you don't have a UNIQUE constraint on the `polenum` field?

Oops. Query example removed as it was not correct, will post a valid solution.

blang

3:02 am on Feb 26, 2009 (gmt 0)

Ok, my bad. I left out a couple of crucial components.

Try this self join:


SELECT 
 a.sysid AS first, b.sysid AS second, a.polenum 
FROM kenpoles AS a 
INNER JOIN kenpoles AS b 
 ON ( (a.polenum = b.polenum) AND (a.sysid <> b.sysid) ) 
WHERE scname = 'Morgantown' 
GROUP BY a.polenum

I left off the GROUP BY and the conditional on the `sysid` field.

Second edit to clean it up.

whoisgregg

2:21 pm on Feb 26, 2009 (gmt 0)

I can't wait to hear how much faster blang's query completes. :)

Added So when you say you have 1.6m records, but are only processing 50K at a time, does that mean your table has 1.6m rows and approx 50K of those have

scname='Morgantown'

? Or are you limiting your selection to 50K through some other approach?

wkpride

10:12 pm on Feb 26, 2009 (gmt 0)

I'll let you guys know how fast it is. I want to study up on JOIN & hope to run it this weekend. yep, 1.6m rows divided by scname(s). That seemed like the best way to process the query (time wise) ... You suppose using the JOIN will enable the entire db( 1.6rows )? Remember, I'm using a 4G laptop...

Oh, one more thing.
From what I've read (not much yet) It looks like I'll have to have a primary or unique on one of the tables. Do both tables use/need a primary/unique? Thanks for the help... Most appreciated & I'll let you know the results!

whoisgregg

2:36 am on Feb 27, 2009 (gmt 0)

As I understand it, blang's query will actually perform the JOIN on the entire table first, then discard the results that don't match the WHERE statement. So, technically, your only CPU savings by restricting by scname is not having to return as many rows to your PHP script.

Regarding PRIMARY and UNIQUE, you should be using one or the other in *most* tables. In this situation, having a UNIQUE on polenum would have prevented duplicate entries from being created in the first place.

You'll definitely want an index of some type on each the fields involved with this query: polenum, sysid, and scname. If you don't already, that alone is going to speed up even your original code by a huge amount.

wkpride

3:13 pm on Feb 28, 2009 (gmt 0)

SELECT
a.sysid AS first, b.sysid AS second, a.polenum
FROM kenpoles AS a
INNER JOIN kenpoles AS b
ON ( (a.polenum = b.polenum) AND (a.sysid <> b.sysid) )
WHERE scname = 'Morgantown'
GROUP BY a.polenum;

Let me make sure I understand what you have:
a new table named a = kenpoles
a new table named b = kenpoles
a new table/column named a.sysid = first
a new table/column named b.sysid = second

That's losing me. I thought AS populates a new table... If so, "a.sysid AS first" & "first" becomes a new table. But later, (a.sysid<>b.sysid) is used, rather than (first<>second).

I ran it at the sql prompt and rec'd this error:
error 1052(23000) column 'scname' in where clause is ambiguous.

So I changed WHERE scname = 'Morgantown' to WHERE (scname='Morgantown')

Ran it again & rec'd this error:
error 1054(42s22) unknown column 'a' in 'field list'

Am I missing something obvious? Thanks, KP

blang

3:30 pm on Feb 28, 2009 (gmt 0)

No, I was missing something obvious. Sorry about that, yet another small thing I overlooked in the query.

It should be:

WHERE a.scname = 'Morgantown'

The column is ambiguous because it belongs in both tables in the query.

This is simply a self-join statement. You join the table to itself using an alias. The AS keyword just assigns an alias that you can refer to that column or table as. So `first` and `second` are just names to reference the same column `sysid`, but you obviously don't want two columns named `sysid` in the resultset - how would you refer to it? Using an alias clears this up. In addition, using the alias `a` and `b` allow you to join the table to itself.

Another good use of aliases is on long table names that you need to refer to several times in the query, and to reference data returned from a function. Here's an example:


SELECT 
 table1.column1 
 , DATE_FORMAT(table2.date_field, '%b %D') AS formatted_date 
FROM this_table_has_a_long_name AS table1 
INNER JOIN yet_another_long_table_name AS table2 
 USING( IDcolumn ) 
ORDER BY table1.column1 
;