|Search Engine Traffic|
Calculate total spider hits
I have a question regarding traffic of spiders and I need some opinions or suggestions:
There is a free pdf-download on a website. There are 8 detailed weekly traffic logs showing hits from search engine spiders and hits of downloads. There are 21 weekly traffic logs showing only the total number of hits to the file (without sub totals of hits from spiders or downloads) and I must calculate/estimate the number of hits of actual downloads during these 21 weeks.
What is the best approach:
1) Based on the 8 weeks detailed traffic logs, calculate a weekly average of hits from spiders and use this weekly average to calculate the actual downloads during the 21 weeks (total hits - spider hits = downloads).
This approach is based on the assumption that the frequency of visits from search engine spiders is pre-scheduled by the search engine, is a roughly consistent average, and is not dependent on - or related to - other traffic/hits to the pdf-file.
2) Based on the 8 weeks detailed traffic logs, calculate a percentage (%) of the hits from spiders (in relation to total hits) and use this percentage (%) to calculate the actual downloads during the 21 weeks (total hits - % = downloads). This approach is based on the assumption that the frequency of visits from search engine spiders is related to and dependent on other traffic to the pdf-file and can be expressed as a percentage from the total number of traffic/hits to the file.
I would appreciate any suggestions or comments for finding the most appropriate and correct approach for calculating the downloads for the 21 weeks.
If you haven't been welcomed to Webmaster World?
Than in that tradition welcome.
Your mail has been laying here a few days now and nobody has provided an answer :(
I'm afraid because the reason for what you inquire would nearly be impossible to achieve. At least with any amount of accuracy.
The assumptions and variable you suggest are far too many.
I'd say it all depends on how accurate you need this to be. 8 weeks is jusat a half decent sample to extrapolate from, as long as you don't need "rocket surgery", but it is only "half decent".
You can tell whether the 8 week pace total traffic is in line with the 21 weeks pace -- and that's the issue. If those two rates differ too much, then you will have trouble no matter which approach you take.
However, the assumption that spidering proceeds on a schedule, or a steady pace of some kind is not usually accurate - it's a very lumpy stream.
Is your goal to get human traffic numbers? Then, in your 8 week numbers, how are spiders defined? Only major search engines? A solid list of known IPs for spiders? All likely user agents, such as Moz 2?
All in all, I'd calculate both ways and see how far apart the two numbers are. And in any case, I would not base a major business decision exclusively on either number.
I would look at the weekly totals for the top browsers (probably MSIE 6.0, 5.5 and 5.0) and the top robots (Googlebot, Ask Jeeves, msnbot, Slurp). If the ratio of robots to browsers stays fairly constant then you can apply that percentage to your total hits.
Thank you wilderness, tedster and dcrombie for your comments and suggestions.
I am aware that it will not be possible to get exact numbers, but I must get the best possible numbers based on a most realistic approach.
The free download (pdf-file) has 262,012 bytes, and hits with this exact file size we consider downloads or human traffic. All other hits we consider hits from spiders. (I was told that pdf-files must be downloaded in one hit with the total file size/bytes, is this correct?)
According to the 8 weeks detailed logs, there were total 456 hits (or 57 hits/week), including 385 hits (48 hits/week) from spiders, and 71 downloads.
The 21 weekly logs (showing only the total of weekly hits), show a dramatic increase of hits, in some weeks up to 10 times (up to 600 hits/week) compared to the 8 weeks detailed logs. Now I have to figure out the total of human traffic or number of free downloads during these 21 weeks.
I cannot see a reason why spider traffic should increase up to 10 times for a time period of 12 weeks and than return to almost normal traffic, and I believe that this increase is not related to spider traffic, but to an increase of human traffic.
I thought that establishing an average of weekly hits from spiders and using this weekly average to calculate the number of downloads during the 21 weeks would be a realistic approach. But I am not sure if this is correct, and I would like to get some other opinions.
Regarding the ratio to browsers as suggested by dcrombie: my logs do not identify browser types.
|The free download (pdf-file) has 262,012 bytes, and hits with this exact file size we consider downloads or human traffic. All other hits we consider hits from spiders. (I was told that pdf-files must be downloaded in one hit with the total file size/bytes, is this correct?) |
In the event that PDF's in a very limited option and in the hosts preference for logs rather than the customers?
As compared to full logs being available?
I'd suggest another host.
For about a year when I first began my sites, they were hosted by a provider who offered excellent servie. The logs however were generated in an incomplete form by a script and required me about two days at the end of each month to compile reasonable stats.
In the end the host had other weaknesses and the log theme should have provided that insight to me :(
I agree with wilderness.
It's the old adage if you can't measure it you can't manage it.
It used to be that PDF's caused all sorts of problems with stats as each page counted as a download, as it was served up one page at a time (maybe someone active in PDF tracking can confirm if that is still the case)
The entire problem is a year old and has nothing to do with our current host. Bad record keeping or not, the 21 weekly logs (with only the total number of weekly hits), and the 8 detailed weekly logs is all we have.
I agree with both of you, but unfortunately it does not resolve my problem. Somehow I must come up with an answer and I am looking for the most realistic approach, which certainly will not and is not expected to be perfect.
I must establish any kind of trend - such as an average of weekly spider hits from the detailed 8 weekly logs, or a percentage) and deduct it from the total number of hits. The question is, what is the most realistic approach.
|Somehow I must come up with an answer and I am looking for the most realistic approach, which certainly will not and is not expected to be perfect. |
With the above in mind and with it understood that this is NOT a realistic approach to stats!
You have many references to the term "hits."
Hits is a beginners stat and the hits on a page are the the page itself and the total number of images on that page.
If the page includes four images than that page will return five hits.
Or have you just mistakenly misused the the term hits and are referring to actual visitors?
If you have used the term hits "correctly" and as most web log softwares use the term than have you deteremined the quaity of hits for each page of the website?
Do the PDF logs conatin lines for images?
The file size of the PDF is ingsignificant as it depends enttirely on may more variable methods of which a PDF may be created.
The bots which crawl websites vary so much from website to website that it's impossible to measure any frequency. I have major bots (google, msn and all the Yahoo bots crawling my sites constantly and daily. There is no beginning and no end. Even Jeeves which may use the most consistent method, grabs the majority of pages in one crawl and then returns throughout the month with partial crawls. The major bots have NO predertimined date (such as the 3rd or 10th) of each month to begin their new crawls.
What of the unknown and malicious bots? They have no frequency, nor, have you mentioned any method of preventing or identifying these bots?
In the end for you, I believe the most accurate method of stats may be, to use a before and after this period your intersted in with "accurate and full logs" and make an adjustment for any varitions in the differecnes between those two periods.
Sorry for using the term " hits". There is no webpage, there are no images and the access/hit from a visitor automatically initiates the download.
I thought that during the time period of two months (or 8 weeks) the total number of engine hits would be roughly the same (with plus/minus 20%).
Thanks for your very helpful comments and suggestions.