Forum Moderators: open

Message Too Old, No Replies

Big table needs faster lookup

         

Dabrowski

4:48 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ok, I have solid background in programming, but am new to SQL and have just started playing with it. I've got it all working ok the way I want it for now, I am using Perl so I'm not trying to write huge SQL queries or anything.

I'm running MySQL 5.something, and I've got a table, well, 2 tables that form a web stats system.

The first table (stats_hit) logs when the user first hits the site, the second table (stats_hit_adv) logs for each page as they're moving through the site.

The problem is that after some time the tables are rather large, a count(*) reckons that stats_hit has 67,000 rows and stats_hit_adv has 341,000!

That in itself is not the problem, the problem is that with that much data it takes so long to lookup the results that the CGI page times out before it's done.

I know you're supposed to use primary keys and indexes to speed things up, which currently I have neither. I know how to create the above, but the MySQL help pages are a little sketchy on the details like, what are they, and how to use them.

Advice much appreciated.

SteveLetwin

5:27 pm on Aug 20, 2007 (gmt 0)

10+ Year Member



What's the structure of the tables, and possibly more important, what are the queries you're using? Are you just dumping the table out wholesale, or are you whittling the table down to only a few dozen (or thousand) rows?

Dabrowski

6:06 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the table definitions......


mysql> describe stats_hit;
+-----------+------------------+------+-----+---------+-------+
¦ Field ¦ Type ¦ Null ¦ Key ¦ Default ¦ Extra ¦
+-----------+------------------+------+-----+---------+-------+
¦ cookie ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
¦ url ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ page ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ referrer ¦ text ¦ YES ¦ ¦ NULL ¦ ¦
¦ timestamp ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
+-----------+------------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

mysql> describe stats_hit_adv;
+-----------+------------------+------+-----+---------+-------+
¦ Field ¦ Type ¦ Null ¦ Key ¦ Default ¦ Extra ¦
+-----------+------------------+------+-----+---------+-------+
¦ cookie ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
¦ url ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ page ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ timestamp ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
+-----------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

And the query........

SELECT * FROM stats_hit WHERE url='www.mydomain.com' AND timestamp >= 1234567890 AND timestamp <= 1234567890;
SELECT * FROM stats_hit WHERE url='www.mydomain.com' AND timestamp >= 1234567890 AND timestamp <= 1234567890;

This query is just the visitor hits report, the 2 are not linked at this point. This shows the number of visitors vs. number of pages viewed.

Timestamp is the number returned by Perl's time() function (number of seconds since 01/01/80). This range will vary depending if they want days/weeks/months etc....

URL will be changed in the future to a numeric ID rather than a text string.

SteveLetwin

3:35 am on Aug 21, 2007 (gmt 0)

10+ Year Member



My understanding is that you just need to add indexes to both timestamp and url. Currently the whole table needs to be read and the results filtered out. But if they're indexed then only a portion of the table needs to be searched for further filtering.

If you analyze the queries then somewhere it should list whether or not a full table scan is being done.

LifeinAsia

3:54 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Changing URL to numeric should definitely help, as will adding the indexes previously mentioned. Remember- indexes are your friends!

Our logs run into tens of millions of records and queries would probably kill the server if we didn't have indexes.

[edited by: LifeinAsia at 3:56 pm (utc) on Aug. 21, 2007]

Dabrowski

4:05 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ok, read up a bit on indexes...

ALTER TABLE stats_hit ADD INDEX( url, timestamp);
ALTER TABLE stats_hit_adv ADD INDEX( url, timestamp);

Will add joint indexes, does MySQL have a built in benchmark so I can see the differences?

SteveLetwin

6:12 pm on Aug 21, 2007 (gmt 0)

10+ Year Member



There's no built in benchmark, but there is explain [dev.mysql.com].

(Ignore the underscores, they're a formating kludge (and maybe even the wrong kludge to use))


mysql> explain select * from animals where name='sue';
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
¦ id ¦_select type ¦_table _ ¦ type ¦_possible keys ¦_key _¦ key len ¦_ref _¦ rows ¦_Extra _____ ¦
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
¦ _1 ¦_SIMPLE ____ ¦ animals ¦_ALL _¦ NULL ________ ¦ NULL ¦ __ NULL ¦_NULL ¦ __ 4 ¦_Using where ¦
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

There's no indexes on the table, so notice under type it's ALL (the whole table must be read). And under rows it says 4 (the number of rows "MySQL believes it must examine to execute the query.")

Then I add an index:


mysql> alter table animals add index (name);
Query OK, 4 rows affected (0.03 sec)
Records: 4 _Duplicates: 0 _Warnings: 0
_
mysql> explain select * from animals where name='sue';
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
¦ id ¦_select type ¦_table _ ¦ type ¦_possible keys ¦_key _¦ key len ¦_ref _ ¦ rows ¦_Extra __________________ ¦
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
¦ _1 ¦_SIMPLE ____ ¦ animals ¦_ref _¦ name ________ ¦ name ¦ ____ 21 ¦_const ¦ __ 1 ¦_Using where; Using index ¦
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

Now the type is ref which is not a full tables scan which is good. (It means other stuff too, but I'm not sure what.) The rows went down too, and it says it's using the index under Extra.

You'll probably want to just read the manual for explain, since my understanding of it is cursory at best.

Dabrowski

9:42 pm on Aug 22, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ok, here's what I've done......

The first explain comes up with this, as you say, searching 'ALL' and 329,000 rows.

+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+
¦ id ¦ select_type ¦ table ¦ type ¦ possible_keys ¦ key ¦ key_len ¦ ref ¦ rows ¦ Extra ¦
+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+
¦ 1 ¦ SIMPLE ¦ stats_hit_adv ¦ ALL ¦ NULL ¦ NULL ¦ NULL ¦ NULL ¦ 329417 ¦ Using where ¦
+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+

I did a few test queries, and with timestamp set to a value that returns no data, it took an average of 3 seconds per query. Now I realise a mistake with my Perl code, the data is arranged on a day by day basis, so I'm doing a query per day, multiplying this 3 seconds +data to however many days. I'll change it to do one lookup and sort it out offline.

I then added an index, as above but with url(20) instead as it complained. Definately need to change that to an int unsigned.

Anyways, the new explain looks like this.......

+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+
¦ id ¦ select_type ¦ table ¦ type ¦ possible_keys ¦ key ¦ key_len ¦ ref ¦ rows ¦ Extra ¦
+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+
¦ 1 ¦ SIMPLE ¦ stats_hit_adv ¦ range ¦ url ¦ url ¦ 28 ¦ NULL ¦ 20972 ¦ Using where ¦
+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+

It appears to now be using url as a key, and only 20,000 rows, but under Extra it's not saying about using the index?

I did a few queries again to test, first was 0.5 sec, then 0.13 sec, I assume it has cached the results this time, but almost a 30x improvement!

Any ideas why it's not using the index?

Dabrowski

9:49 pm on Aug 22, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



oooo I just tried the reporting frontend - it's working again, no CGI timeout!

SteveLetwin

11:00 pm on Aug 22, 2007 (gmt 0)

10+ Year Member



It is using the index. From the EXPLAIN docs regarding the meaning of the type "range":
Only rows that are in a given range are retrieved, using an index to select the rows.

And under key it lists url as the index being used.

My earlier comments about the Extra column were misleading if not outright wrong. "Using index" means that only info in the index is being read, that no actual rows must be read. So if you did

SELECT URL FROM . . .
then it'd say "using index", but that wouldn't be of much use. In the example table I used earlier, there was only 1 column, hence the confusing results.

[edited by: SteveLetwin at 11:03 pm (utc) on Aug. 22, 2007]

Dabrowski

9:18 pm on Aug 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see.

In that case, job done! I'll optimise the Perl code when I get a sec and I should be back to a reasonable loading time!

Let me know if you want any help with a Perl based advanced web stats system!

Thanks again for the help!

Chris.