Forum Moderators: coopster

Message Too Old, No Replies

php mysql optimization

         

omoutop

9:31 am on Aug 31, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have been assigned the task of creating some graphical stats for a website, out of some saved data.

Facts:
- there are 3 databases in use. dbCurrent, dbStats, dbBackup.
dbCurrent is the main database of the website
dbStats hold various tables of statistics and tracking data
dbBackup holds the last five years stats/tracking tables.

the data i will use come from two databases ( dbStats, dbBackup )

the table names are: stats2006, stats2007, stats2008, etc, except the current stats which is just "stats". Each table has data for its year.

the table structure for each year of data is the same: primaryID field is integer productID field is integer
dateMonitor field is integer (unixtimestamp)
pageName field is varchar (20)
productID, dateMonitor, pageName fields have also indexes

in other words, what product was viewed on what date and from what page.

So, what i thought is create a loop out of each table and get my data. Each query of mine looks like:

Select COUNT(primaryID) as myCounter FROM $tablename WHERE $conditions

where $tablename and $conditions are variables based on each loop.
All conditions are similar to:
- dateMonitor between date1 and date2
- pageName='some val'
- productID IN ($comma_separated_values)
- combination of the above

All of these are working decently so far (for a single product).

When i try to create a report to compare 'x' products in 'y' years (chosen dynamically from admin/moderator) the script runs for more than 15 minutes.

I am looking for a way to improve the performance of the script. Logic/structure i use so far, follows:

Loop through products to find the ids to use (typical format is: x,y,z (comma separated values)
Open Loop through years/months
Execute one sql query for each affected table/database to get the number of affected rows.
Populate my arrays of data
Close year loop
Send data to graph script (jquery jqPlot to be exact) to print on screen


I have used many times the jqPlot graph script without any problems so far.

Any help/idea appreciated

lostdreamer

10:41 am on Aug 31, 2011 (gmt 0)

10+ Year Member



The query " Select COUNT(primaryID) as myCounter FROM $tablename WHERE $conditions " should not be that tough... So I'm thinking you might be doing nested queries?


while($row = ...) {
$query = "SELECT .. FROM ... ";
$rst = ...
while($row = ...) {
... more code...
}
}


If so you would be doing queries exponentialy greater then the amount of products / dates you're selecting and it would probably be a lot better to do a single query with the needed JOINs.

Could you perhaps copy paste a part of the code that does the actual query?

omoutop

10:57 am on Aug 31, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ok here is part of the queries:


for ($m=1; $m<=12; $m++) // loop months
{
$startDate = mktime(0, 0, 1, $m, 1, date("Y"));
$daysOfMonth = date("t", mktime(10, 10, 10, $m, 10, date("Y")));
$endDate = mktime(23, 59, 59, $m, $daysOfMonth, date("Y"));

$query_chk1 = "SELECT COUNT(*) AS monthlyTotal FROM db1.table1 WHERE dateMonitor>='$startDate' AND dateMonitor<='$endDate' AND pageName='$myPage' ";
$result_chk1 = mysql_query($query_chk1) or die(mysql_error().'<p>'.$query_chk1.'</p>');
$row1 = mysql_fetch_array($result_chk1);
$num_results1 = $row1['monthlyTotal'];
$monthlyValues[] = $num_results1;

}


Of course this part is inside two other loops
- one loop of databases/tables
- one loop of years

Final array data are for: 5 years, 12 months, 14 products each month

omoutop

11:47 am on Aug 31, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



First speed improvement came from using the UNION command.

SELECT COUNT(*) AS monthlyTotal FROM db1.table1 WHERE dateMonitor>='1167606001' AND dateMonitor<='1170284399' AND dateMonitor='test' 
UNION ALL
SELECT COUNT(*) AS monthlyTotal FROM db1.table2 WHERE dateMonitor>='1170284401' AND dateMonitor<='1172703599' AND dateMonitor='test'
...


This droped my execution time from 15-17 mins to 11-13 mins.

Any other ideas?

lostdreamer

10:11 am on Sep 1, 2011 (gmt 0)

10+ Year Member



Actually,

Since most systems I've designed will SELECT data more often then INSERT data, I allways code by the rule: SELECT should be faster then INSERT.

In other words: If your data is structured in a way that SELECTing data from it will take to long, you probably have an error in your structure itself.

What you're getting now is: 5x12x14x {amount of databases/tables} = #*$! amount of queries for any report.

If all of these queries only take a few seconds, it will still take tens of minutes total.

You are now looping and creating seperate queries for every month...
Another thing you could try is this:

SELECT COUNT(id) AS monthlyTotal, MONTH(FROM_UNIXTIME(dateMonitor)) as month FROM db1.table1
GROUP BY MONTH(FROM_UNIXTIME(dateMonitor))
WHERE dateMonitor='test'


If this query takes to long, I suggest that you edit the tables, and add a year / month field instead of just the unix timestamp.

This way you could do (a lot faster queries) like:

SELECT COUNT(id) AS monthlyTotal, year, month FROM db1.table1
GROUP BY year, month
WHERE dateMonitor='test'


These queries would give you all totals of all months and years (if you put all data in 1 table, instead of stats_2003, stats_2004 etc.
And they would do it within seconds, instead of minutes.

Let me know if speed increases ;)

omoutop

1:17 pm on Sep 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thanks for the insight... it will take some time to convert the data to these tables..there are total of 130 million records to merge into 1 single table, so it will take a while :)

rocknbil

4:35 pm on Sep 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cut out everything you can - this includes loops. Set up your front end so only a reasonable range can be queried - 5 years of data will obviously hang up, a few months, not so much.

Try to avoid storing in arrays if you can - this consumes memory that can be used for your queries.

Loop through products to find the ids to use (typical format is: x,y,z (comma separated values)
Open Loop through years/months
Execute one sql query for each affected table/database to get the number of affected rows.


Combine all of these in one select statement somehow. Use the mySQL date commands instead of mktime.

As you go through your loop, print to STDOUT as you go* **or** store the "results" in a scalar and print it at the end of the loop. This will consume less memory and won't have to be looped through again.

*Normally I recommend against this as it gives you the "half printed page/wait/print some more" effect on large data sets, but sometimes it's the only way to speed things up.

Avoid * wherever possible. Select specific fields.

A generic example might be


$query = "select field1,field2,field3,field4 from $table";
// These should be SELECT lists, not user input
$start=$end=$compare=$where=null;
if (
isset($_POST['startmonth']) and ($_POST['startmonth'] > 0) and
isset($_POST['startday']) and ($_POST['startday'] > 0) and
isset($_POST['startyear']) and ($_POST['startyear'] > 0)
) { $start = "$_POST['startday']-$_POST['startmonth']-$_POST['startyear']; }
if (
isset($_POST['endmonth']) and ($_POST['endmonth'] > 0) and
isset($_POST['endday']) and ($_POST['endday'] > 0) and
isset($_POST['endyear']) and ($_POST['endyear'] > 0)
) { $end = "$_POST['endday']-$_POST['endmonth']-$_POST['endyear']; }
if ($start and $end) { $compare = " (dateMonitor >='$start' and dateMonitor <= '$end'"; }
else if ($start) { $compare = dateMonitor >= '$start'; }
else if ($end) { $compare = dateMonitor <= '$end'; }
if ($compare) { $where = " $comparator"; }
if (isset($_POST['pageName']) and ! empty($_POST['pageName'])) {
if ($where) { $where .= ' and'; }
$where .= " pageName='$_POST['pageName']'";
}
if (isset($_POST['productID']) and ! empty($_POST['productID'])) {
if ($where) { $where .= ' and'; }
// If single value, don't use IN - faster
if (preg_match('/,/',$_POST['productID'])) {
$where .= " productID in ('$_POST['productID']')";
}
else { $where .= " productID = '$_POST['productID']'";
}
//
if ($where) { $query .= " where $where"; }


Another thing that will REALLY help is entering a limit clause and pagination into the scheme. Default it to something like 25 or 50 records if the user doesn't select some other page limit. this will require two selects, the count select would use the same where conditions ... just no limit.

$query = "select field1,field2,field3,field4 from $table";
$count = "count(ID) from $table";
// same where conditions, append $where to $count
// Execute count query, store total records in $record_count
if ($record_count > 0) {
// append order and limit to $query, execute
// build pagination list based on $total $record_count
}
else {
echo "no records found";
// re-output search form here
}

brotherhood of LAN

4:43 pm on Sep 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since most systems I've designed will SELECT data more often then INSERT data, I allways code by the rule: SELECT should be faster then INSERT.


Good advice IMO, particularly for web-facing apps with more visitors than data updates/insertions.

The table engine counts for a lot too, and can yield different answers. Tweaking the MySQL ini variables can give massive performance gain when they're available to edit.

omoutop

6:04 am on Sep 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Loop through products to find the ids to use (typical format is: x,y,z (comma separated values)
Open Loop through years/months
Execute one sql query for each affected table/database to get the number of affected rows.


As you can see in my 3rd post here, i have already merged my queries into a single one, using UNON ALL. Also, no * is used on selects, only a COUNT(). No GROUP BY, no ORDER BY. Only 1-3 restrictions in WHERE clause (product, date (month or year) and page

Use the mySQL date commands instead of mktime.

Saved data are already stored as unix timestamp.

This is no front end application... only 2 persons will have access here. Purpose is to create some graphs to compare yealry/monthly performances. Script will run manually every month.

I have no access in any ini files also.

lostdreamer

7:42 am on Sep 2, 2011 (gmt 0)

10+ Year Member



What also might be a valid option is to aggregate the old results...

I meen: You probably dont need to know (in your reports) that product X has been viewed at 22:14:13 PM on August 14th 2008.
What you probably do want to know is that product X was viewed 19 times on August 14 2008 (or maybe even just that prod X was viewed 214 times in August 2008).

So, when a year has been closed, you might want to think about aggregating all results to the depth you need (probably day).

The aggregation would probably take about as mucht time as 1 report will take you now, but this could help you go from 130.000.000 to about 1-10.000.000 rows (and save just as much time)

Something like the following should be able to aggregate your tables to a new table for the reports... (it would aggregate to monthly data)

INSERT INTO aggr_table (year, month, page, product, views) VALUES (
SELECT
YEAR(FROM_UNIXTIME(dateMonitor)) as year,
MONTH(FROM_UNIXTIME(dateMonitor)) as month,
pageName,
productId,
COUNT(primaryID)
FROM stats2006
GROUP BY year, month
)

After creating the 'aggr_table' table and running this query on the DB for your year tables, you should get 1 table with the aggregated results of all your data which can be queried a lot faster (think seconds)

omoutop

11:01 am on Sep 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yes, thats what i am building now lostdreamer. We currently need to know only year/month data (not even per day).

So far, after converting 3 tables (out of 5), the script executes ar 34 seconds!
I assume with all 5 tables it will go near 1 min, but thats acceptable (since original attempt was 15+ minutes).

thank you.