Forum Moderators: coopster
I have been experimenting with mysql and php a lot over the last year. I have built my own webstats here:
This stats report is drawn from 1 simple table that looks like this:
I know. It is horrible. Of course my queries have gotten slower and slower as the table is filling up with about 3000 new entries per day.
I have made a lot of effort trying to find out how to normalize this table, but I don't quite get it.
What would be the best way to go if I wanted to shape this table into a normalized db? I have read the article below, but i don't get it at all. Could you help out a little?
[devshed.com...]
[edited by: jatar_k at 11:35 pm (utc) on April 2, 2004]
[edit reason] no personal urls thanks [/edit]
It speeds up things remarkebly. for example:
select * from table where taal="nederlands"
create an index on taal and the query goes superfast. You can create multiple indexes.
(ps. Urls are not allowed)
ps 2. Fulltext only helps if you are using match(table) against(string) else just remove them, i cant imagina a fulltext on a timestamp dooing any good.
if i understood you correctly, you're using currently only one table for every entry.
if so you can save some space by normalizing the data.
but this means, that you have to use joins, which will slow down the sql execution time. as ikbenhet1 said, you should create some indexes on the table.
you could for example put a index on ip. as you have a "report" with "unique ips" at the top, i quess you make a "select distinct(ip) ..." if you put an index on the ip address field, you should be able to get this number a lot faster.
if you have some more fields you select frequently, i would put an index there too.
if you still want to normalize your db, you could do smth. like this:
table_ips
table_request
table_referer
table_agent
in these tables you have alwasy id and value as columns. for each value you have only one entry.
so if i visit 10 pages of your site there would be one entry for ip in the table_ips (eg 1, 123.456.789.0)
and in your table_log you have then smth like this:
log_id, ip_id, request_id, referer_id, agent_id
so you whith this solution you save space on the harddisk and can make parts of reports faster.
eg. you want to know your top requestet sites. because in the table_log a select and group by numbers is much faster then by strings. but keep in mind, that you have to join with the table_request in this case.
i hope this helps. if you need more help, just sticky me.
barn
Your layout is not horrible; no obvious issues. You would be advised to avoid the use of varchar() type unless you truly need it, as the presence of even one varchar() field in a table forces mysql into accommodating variable length fields (a relatively inefficient process). Use fixed-lenght fields and don't worry about unused space.
Database normalization relates to the use of your data more than anything else. Your desired queries will determine the best way to design your tables, indexes, and the degree to which you desire formal normalization. It is a trade-off between data management ability/efficiency and data analysis ability/efficiency.
IF you will be asking "whois my most frequent visitor" then separating out IP into a visitor table (and joining it on ID) improves normalization, efficiency, and performance (since the IP number of a frequent visitor is stored just once). Side benefits include the improved ability to run geo-based visitor analysis using an IP2geolocation table. Such scalability is the primary benefit of normalization.
Similarly separating out the browser field so you could tally "Mozilla" vs "IE5" etc would not be similarly effective. That user agent field is quite variable, and is probably best stored "as reported" in a text field. You would be advised to parse at query time using regular expression matching. Ideally you need both -- and will build over a time a data table of all known variants of the user-agent field (which you will still have to analyze with expression matching to answer queries about browser).
Ideally you need to have all of your queries defined before you design your tables... that will enable optimal design, as the feasibility of your queries determines your table layout: if you can't build the join, you will need to change your tables. This is the biggest lesson I can try to convey.
Finally, expect this to be very slow and very large no matter what. Log file management and analysis is a big job no matter how you slice it.