Forum Moderators: open
The spider requests 3 or 4 different pages per day from each site about 6 to 8 hours apart, so often the page requests happen at a similar time each day. Each page request is for just the plain webpage, no JS, CSS or Images, and the give away, is that each request has a different browser user-agent (sometimes it pretends to be IE, sometimes Konquerer etc) All of the requests are from a small IP range 66.194.6.73 - 66.194.6.81 which is owned by Websense.com
I don't know what they are doing, but I don't like or trust spiders that try to hide their real identity. I thought initially it might be browsers checking bookmarks, but it does requests using GET not HEAD and is working (slowly) through every page on the sites.
I used to just feed them a 403 also but I've recently put most of these spiders on my ignore list and let them have their way. I believe that if they can't categorize the site then it will be placed on the forbidden list until it can be checked by hand. There are quite a few of this type of things out there, including BorderManager, NetSweeper, and Netspective. Libraries, schools and large corporations are using this software and your website will be denied to the users.
We all have to make the choice of what is a useful robot and what isn't of course.
[novell.com...]