Average traffic percentage of spider visits

Forum Moderators: open

Message Too Old, No Replies

Average traffic percentage of spider visits

WebRookie

11:07 pm on Nov 28, 2000 (gmt 0)

What percentage of overall monthly traffic do you attribute to spider visits? Any comparisons on what a good average percentage is in spider traffic to aim for?

chiyo

3:01 am on Nov 29, 2000 (gmt 0)

Great question! It's always been a wonder for me how many of Web sites claimed "visits" are simply admin hits like designers hits, spider visits, page checking utilities, automated spiders, email extractors, etc,

We worked out that around 30-35% of page hits and around 10% of unique visitors can be attributed to Search engine spiders throughout 4 sites.

WebRookie

4:37 pm on Nov 29, 2000 (gmt 0)

Hi chiyo, thanks for the response. We're averaging 12% per month at this time for unique visitors. Had nothing to compare to. This also started me thinking about how much traffic is coming in through directories only.

Wouldn't it be great to sift out the "extras" as you mentioned to get a truer number of statistics, and find a good logs program or script that could be tweaked to do this?

I'm hoping our new logs program will help with directory totals and we can get more specifics overall on spider visits to our sites.

littleman

7:47 pm on Nov 29, 2000 (gmt 0)

I've had it as high as a 50-50 split.

WebRookie

8:12 pm on Nov 29, 2000 (gmt 0)

Wow, great. Is that for a number of sites or one site?

littleman

8:16 pm on Nov 29, 2000 (gmt 0)

It is over 100 domains.

eljefe3

4:24 am on Nov 30, 2000 (gmt 0)

>>I've had it as high as a 50-50 split.

Littleman, nice spider trap there.

skirril

12:36 am on Dec 19, 2000 (gmt 0)

Depending a bit on the month, and day actually.

I have had as high as 60% the candidates for spiders, though its hard to say.

You can't really say that if eg. the ip doesnt resolve, its a spider, many dialup-ups dont reslove either.

Otoh more and more spiders 'cloak themselves'.

I'd say realistically, 30-40% are spiders

Skirril

WebRookie

1:14 am on Dec 19, 2000 (gmt 0)

Hi skirril. Welcome to the forums.

Are you looking through raw log files? In our case we use a logs program. Certainly some of the visits are listed in the logs report under spiders but as you say, how accurate is this? We've also had speculation that more are coming in but not recognized by the logs program.

skirril

6:14 pm on Dec 19, 2000 (gmt 0)

Atm I use analog to analyse the logfiles, however, I also look at the raw files (apache: access_log, agent_log etc)

IMHO, the following identifies a (well-behaved) spider:

a) gets robots.txt

b) crawls all pages of your site that are linked together, except for those excluded in the robots.txt file or the ROBOTS meta tag. Time to crawl varies between spiders (I have seen anything from more or less all pages directly to a wait of up to 30mins between crawls)

c) if its well-writen it will not get stylesheets, graphic files, and binary files. It will also never do a POST of form data

d) if its well-behaved, you'll get a robot name and an url in the user agent.

e) If the address it comes from can't be resloved to a dns name, it may be a sign for a spider, but need not be. Often addresses located in far-east ('developing') countries do not reslove. Also, .com, .net, etc need not be in the US. Many ISPs also do have (dial-up) addresses that do not have a name associated.

f) spiders might also use the HEAD instead of the GET command, mostly to see whether a page was modified since the last crawl, or to see what http software is used (netcraft.com)

To conclude, I cannot say how many of those rules need to be true to identify a spider, and there may be spiders that fail all those rules.

Whats also said on the analog website (www.analog.cx, www.analog.cx/docs/meaning.html) is that it is impossible to determine how many ppl/spiders visit your website.

Skirril

WebRookie

6:52 pm on Dec 19, 2000 (gmt 0)

Really interesting post. Thanks for the analog page, it's helpful. Makes more sense now how difficult it is to find out the true numbers of spider visits.

>more and more spiders 'cloak themselves'.

Would you talk about this a little more? Curious about the purpose of cloaking spiders.

msgraph

7:24 pm on Dec 19, 2000 (gmt 0)

Some of the major search engines send out robots set up as common users from time to time.

They will use user agents like:

Mozilla/4.0 (#could be anything#)

There are many reasons that they do this:

1. To check for sites that redirect any non-spiders from viewing doorway pages.

2. To check the integrity of a site. Like to make there is no major delays on the server/site.

But just because you see some type of Mozilla user visiting your site off an Altavista or Inktomi server, do not jump to conlusions right away that this is a new spider. The people in their companies surf the web too, just like you or me. This is when it's useful to have a few more domains under your belt or to post some questions on this forum to verify things.

WebRookie

7:43 pm on Dec 19, 2000 (gmt 0)

Right, I'm not using doorways so this makes sense. I'm missing info on doorways and cloaking, good to know.

I've seen many user agents in our logs report as you've listed. And we do have a few re-directs, mostly from defunct centers that re-direct to our home page.

skirril

7:45 pm on Dec 19, 2000 (gmt 0)

In my opinion the 'cloking of spiders' can have the following reasons (non-exhaustive list)

a) Browser optimisation/ dynamic pages
----------------------------------------
I think it is due to the fact that there are many sites that use a little script in the beginning of the page to 'optimise' the display of the page for the current browser, or dynamically generate the page, taking into account the current browser.

More often than not, those scripts are so simple to only check for the two most important browsers (giving the rest of the users a message of the kind: your browser is unable to display this page (it doesnt support frames/tables/whatever), so please update to the new FooScape Explorer 9.99).

As we all know, this can be bad web-design, and in the middle to long run, you might pay for such insolence. I dont even take into account here that on most browsers, java, which is usually used for such things, is slow, and might the potential customer turn away before the page is even loaded.
Server side scripting otoh usually poses heavy hw requirements on the server, and might mean the 'death' of your site once the traffic increases

A spider indexing that page will then of course also get this message (and hence have nothing useful to index). So, some 'clever' spiders cloaked themselves as one of the common browsers, so that it will get the useful page, and not the 'your browser is old, we dont need your business' page.

This of course makes it extremely hard to distinguish between 'real people', and spiders, esp. if they come through a cache (eg. aol) and were not referred by some other site. To my knowledge there's no way you could distinguish between a 'real user' and a spider, esp. if the beforementioned conditions hold. As described in the reference I gave in my prior post, there's also no way to know 'how many' users are behind a proxy.

b) theft of information, 'economic information warfare'
-----------------------------------------------

A completely different thing which is also imaginable (has surely been done) is to develop a spider crawling the net to search for competition, and stealing information. If I would do that, I'd most certainly cloak my spider, esp. if it has a fixed dns name (as xxx.thecompetition.com).

To me there's only one way to guard against that, and I think I am stating the obvious here:

The net is a public medium. It is extremely hard (nigh impossible) to control who gets the information published on a website. Hence, publish only things that you intend for public release.

A slighlty larger perspective
-----------------------------

'Hacking' has turned from a sport of youngsters it was in the 60's 70's and 80's ('Look I am good, I hacked the CIA') into full-fledged information warfare (denial of service attacks, theft of information, etc). What adds to this is that a fair bit of electronic mail travels the net unencrypted, hence readable to everyone tapping the network.

The only 'immunisation' to this is to store the sensitive information on a well-secured system.
Securing systems is ofc. non-trivial, and the level of security is directly proportional to the amount of resources invested. It is also inversely proportional to the 'usablility' of a system.

Skirril

WebRookie

8:08 pm on Dec 19, 2000 (gmt 0)

Great info, skirril, facinating stuff. I feel like I know a little more today about spiders and bots, thanks.