Forum Moderators: open
I'm running MySQL 5.something, and I've got a table, well, 2 tables that form a web stats system.
The first table (stats_hit) logs when the user first hits the site, the second table (stats_hit_adv) logs for each page as they're moving through the site.
The problem is that after some time the tables are rather large, a count(*) reckons that stats_hit has 67,000 rows and stats_hit_adv has 341,000!
That in itself is not the problem, the problem is that with that much data it takes so long to lookup the results that the CGI page times out before it's done.
I know you're supposed to use primary keys and indexes to speed things up, which currently I have neither. I know how to create the above, but the MySQL help pages are a little sketchy on the details like, what are they, and how to use them.
Advice much appreciated.
mysql> describe stats_hit;
+-----------+------------------+------+-----+---------+-------+
¦ Field ¦ Type ¦ Null ¦ Key ¦ Default ¦ Extra ¦
+-----------+------------------+------+-----+---------+-------+
¦ cookie ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
¦ url ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ page ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ referrer ¦ text ¦ YES ¦ ¦ NULL ¦ ¦
¦ timestamp ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
+-----------+------------------+------+-----+---------+-------+
5 rows in set (0.00 sec)
mysql> describe stats_hit_adv;
+-----------+------------------+------+-----+---------+-------+
¦ Field ¦ Type ¦ Null ¦ Key ¦ Default ¦ Extra ¦
+-----------+------------------+------+-----+---------+-------+
¦ cookie ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
¦ url ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ page ¦ tinytext ¦ YES ¦ ¦ NULL ¦ ¦
¦ timestamp ¦ int(10) unsigned ¦ YES ¦ ¦ NULL ¦ ¦
+-----------+------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
And the query........
SELECT * FROM stats_hit WHERE url='www.mydomain.com' AND timestamp >= 1234567890 AND timestamp <= 1234567890;
SELECT * FROM stats_hit WHERE url='www.mydomain.com' AND timestamp >= 1234567890 AND timestamp <= 1234567890;
This query is just the visitor hits report, the 2 are not linked at this point. This shows the number of visitors vs. number of pages viewed.
Timestamp is the number returned by Perl's time() function (number of seconds since 01/01/80). This range will vary depending if they want days/weeks/months etc....
URL will be changed in the future to a numeric ID rather than a text string.
If you analyze the queries then somewhere it should list whether or not a full table scan is being done.
Our logs run into tens of millions of records and queries would probably kill the server if we didn't have indexes.
[edited by: LifeinAsia at 3:56 pm (utc) on Aug. 21, 2007]
(Ignore the underscores, they're a formating kludge (and maybe even the wrong kludge to use))
mysql> explain select * from animals where name='sue';
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
¦ id ¦_select type ¦_table _ ¦ type ¦_possible keys ¦_key _¦ key len ¦_ref _¦ rows ¦_Extra _____ ¦
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
¦ _1 ¦_SIMPLE ____ ¦ animals ¦_ALL _¦ NULL ________ ¦ NULL ¦ __ NULL ¦_NULL ¦ __ 4 ¦_Using where ¦
+----+-------------+---------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
Then I add an index:
mysql> alter table animals add index (name);
Query OK, 4 rows affected (0.03 sec)
Records: 4 _Duplicates: 0 _Warnings: 0
_
mysql> explain select * from animals where name='sue';
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
¦ id ¦_select type ¦_table _ ¦ type ¦_possible keys ¦_key _¦ key len ¦_ref _ ¦ rows ¦_Extra __________________ ¦
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
¦ _1 ¦_SIMPLE ____ ¦ animals ¦_ref _¦ name ________ ¦ name ¦ ____ 21 ¦_const ¦ __ 1 ¦_Using where; Using index ¦
+----+-------------+---------+------+---------------+------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)
You'll probably want to just read the manual for explain, since my understanding of it is cursory at best.
The first explain comes up with this, as you say, searching 'ALL' and 329,000 rows.
+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+
¦ id ¦ select_type ¦ table ¦ type ¦ possible_keys ¦ key ¦ key_len ¦ ref ¦ rows ¦ Extra ¦
+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+
¦ 1 ¦ SIMPLE ¦ stats_hit_adv ¦ ALL ¦ NULL ¦ NULL ¦ NULL ¦ NULL ¦ 329417 ¦ Using where ¦
+----+-------------+---------------+------+---------------+------+---------+------+--------+-------------+
I did a few test queries, and with timestamp set to a value that returns no data, it took an average of 3 seconds per query. Now I realise a mistake with my Perl code, the data is arranged on a day by day basis, so I'm doing a query per day, multiplying this 3 seconds +data to however many days. I'll change it to do one lookup and sort it out offline.
I then added an index, as above but with url(20) instead as it complained. Definately need to change that to an int unsigned.
Anyways, the new explain looks like this.......
+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+
¦ id ¦ select_type ¦ table ¦ type ¦ possible_keys ¦ key ¦ key_len ¦ ref ¦ rows ¦ Extra ¦
+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+
¦ 1 ¦ SIMPLE ¦ stats_hit_adv ¦ range ¦ url ¦ url ¦ 28 ¦ NULL ¦ 20972 ¦ Using where ¦
+----+-------------+---------------+-------+---------------+------+---------+------+-------+-------------+
It appears to now be using url as a key, and only 20,000 rows, but under Extra it's not saying about using the index?
I did a few queries again to test, first was 0.5 sec, then 0.13 sec, I assume it has cached the results this time, but almost a 30x improvement!
Any ideas why it's not using the index?
Only rows that are in a given range are retrieved, using an index to select the rows.
My earlier comments about the Extra column were misleading if not outright wrong. "Using index" means that only info in the index is being read, that no actual rows must be read. So if you did
SELECT URL FROM . . . then it'd say "using index", but that wouldn't be of much use. In the example table I used earlier, there was only 1 column, hence the confusing results. [edited by: SteveLetwin at 11:03 pm (utc) on Aug. 22, 2007]