Forum Moderators: open

Message Too Old, No Replies

Covert Spider Activity

They can hide, but we still see them

         

uncle_bob

4:41 am on Jun 22, 2004 (gmt 0)

10+ Year Member



I've noticed some covert spider activity on a number of websites. Combining the log files for a week across websites, and sorting them by IP highlighted the spider nicely.

The spider requests 3 or 4 different pages per day from each site about 6 to 8 hours apart, so often the page requests happen at a similar time each day. Each page request is for just the plain webpage, no JS, CSS or Images, and the give away, is that each request has a different browser user-agent (sometimes it pretends to be IE, sometimes Konquerer etc) All of the requests are from a small IP range 66.194.6.73 - 66.194.6.81 which is owned by Websense.com

I don't know what they are doing, but I don't like or trust spiders that try to hide their real identity. I thought initially it might be browsers checking bookmarks, but it does requests using GET not HEAD and is working (slowly) through every page on the sites.

volatilegx

4:12 pm on Jun 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think we can agree that these activities are not consistent with search engine spidering. They actually sound more like the 'bots that spider for trademark infringements, etc. I wonder if Websense is starting something like that?

fiestagirl

9:16 pm on Jun 23, 2004 (gmt 0)

10+ Year Member



I believe that they are categorizing websites for workplace internet filtering by employers. I have the range as: 66.194.6.70-83

jdMorgan

9:45 pm on Jun 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Several more address ranges, too, if you dig for them...

Ignores robots.txt and 403 responses, and won't go away. Best served with a very short 403 response.

Jim

uncle_bob

11:33 pm on Jun 23, 2004 (gmt 0)

10+ Year Member



I've noticed another way of spotting these covert spiders. If you feed them a 301 redirect and they don't immediately go get the new page, then they almost certainly aren't a browser.

fiestagirl

1:40 pm on Jun 24, 2004 (gmt 0)

10+ Year Member



Got another ip range : 63.212.171.128-255

I used to just feed them a 403 also but I've recently put most of these spiders on my ignore list and let them have their way. I believe that if they can't categorize the site then it will be placed on the forbidden list until it can be checked by hand. There are quite a few of this type of things out there, including BorderManager, NetSweeper, and Netspective. Libraries, schools and large corporations are using this software and your website will be denied to the users.
We all have to make the choice of what is a useful robot and what isn't of course.

bcolflesh

1:49 pm on Jun 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've always though BorderManager was traffic tunneled through Novell's firewall/VPN product:

[novell.com...]

fiestagirl

2:06 pm on Jun 24, 2004 (gmt 0)

10+ Year Member



And it supports content filtering. They are using SurfControl.
[surfcontrol.com...]