Forum Moderators: open
We worked out that around 30-35% of page hits and around 10% of unique visitors can be attributed to Search engine spiders throughout 4 sites.
Wouldn't it be great to sift out the "extras" as you mentioned to get a truer number of statistics, and find a good logs program or script that could be tweaked to do this?
I'm hoping our new logs program will help with directory totals and we can get more specifics overall on spider visits to our sites.
I have had as high as 60% the candidates for spiders, though its hard to say.
You can't really say that if eg. the ip doesnt resolve, its a spider, many dialup-ups dont reslove either.
Otoh more and more spiders 'cloak themselves'.
I'd say realistically, 30-40% are spiders
Skirril
Are you looking through raw log files? In our case we use a logs program. Certainly some of the visits are listed in the logs report under spiders but as you say, how accurate is this? We've also had speculation that more are coming in but not recognized by the logs program.
IMHO, the following identifies a (well-behaved) spider:
a) gets robots.txt
b) crawls all pages of your site that are linked together, except for those excluded in the robots.txt file or the ROBOTS meta tag. Time to crawl varies between spiders (I have seen anything from more or less all pages directly to a wait of up to 30mins between crawls)
c) if its well-writen it will not get stylesheets, graphic files, and binary files. It will also never do a POST of form data
d) if its well-behaved, you'll get a robot name and an url in the user agent.
e) If the address it comes from can't be resloved to a dns name, it may be a sign for a spider, but need not be. Often addresses located in far-east ('developing') countries do not reslove. Also, .com, .net, etc need not be in the US. Many ISPs also do have (dial-up) addresses that do not have a name associated.
f) spiders might also use the HEAD instead of the GET command, mostly to see whether a page was modified since the last crawl, or to see what http software is used (netcraft.com)
To conclude, I cannot say how many of those rules need to be true to identify a spider, and there may be spiders that fail all those rules.
Whats also said on the analog website (www.analog.cx, www.analog.cx/docs/meaning.html) is that it is impossible to determine how many ppl/spiders visit your website.
Skirril
They will use user agents like:
Mozilla/4.0 (#could be anything#)
There are many reasons that they do this:
1. To check for sites that redirect any non-spiders from viewing doorway pages.
2. To check the integrity of a site. Like to make there is no major delays on the server/site.
But just because you see some type of Mozilla user visiting your site off an Altavista or Inktomi server, do not jump to conlusions right away that this is a new spider. The people in their companies surf the web too, just like you or me. This is when it's useful to have a few more domains under your belt or to post some questions on this forum to verify things.
a) Browser optimisation/ dynamic pages
----------------------------------------
I think it is due to the fact that there are many sites that use a little script in the beginning of the page to 'optimise' the display of the page for the current browser, or dynamically generate the page, taking into account the current browser.
More often than not, those scripts are so simple to only check for the two most important browsers (giving the rest of the users a message of the kind: your browser is unable to display this page (it doesnt support frames/tables/whatever), so please update to the new FooScape Explorer 9.99).
As we all know, this can be bad web-design, and in the middle to long run, you might pay for such insolence. I dont even take into account here that on most browsers, java, which is usually used for such things, is slow, and might the potential customer turn away before the page is even loaded.
Server side scripting otoh usually poses heavy hw requirements on the server, and might mean the 'death' of your site once the traffic increases
A spider indexing that page will then of course also get this message (and hence have nothing useful to index). So, some 'clever' spiders cloaked themselves as one of the common browsers, so that it will get the useful page, and not the 'your browser is old, we dont need your business' page.
This of course makes it extremely hard to distinguish between 'real people', and spiders, esp. if they come through a cache (eg. aol) and were not referred by some other site. To my knowledge there's no way you could distinguish between a 'real user' and a spider, esp. if the beforementioned conditions hold. As described in the reference I gave in my prior post, there's also no way to know 'how many' users are behind a proxy.
b) theft of information, 'economic information warfare'
-----------------------------------------------
A completely different thing which is also imaginable (has surely been done) is to develop a spider crawling the net to search for competition, and stealing information. If I would do that, I'd most certainly cloak my spider, esp. if it has a fixed dns name (as xxx.thecompetition.com).
To me there's only one way to guard against that, and I think I am stating the obvious here:
The net is a public medium. It is extremely hard (nigh impossible) to control who gets the information published on a website. Hence, publish only things that you intend for public release.
A slighlty larger perspective
-----------------------------
'Hacking' has turned from a sport of youngsters it was in the 60's 70's and 80's ('Look I am good, I hacked the CIA') into full-fledged information warfare (denial of service attacks, theft of information, etc). What adds to this is that a fair bit of electronic mail travels the net unencrypted, hence readable to everyone tapping the network.
The only 'immunisation' to this is to store the sensitive information on a well-secured system.
Securing systems is ofc. non-trivial, and the level of security is directly proportional to the amount of resources invested. It is also inversely proportional to the 'usablility' of a system.
Skirril