Twisted PageGetter

Forum Moderators: open

Message Too Old, No Replies

Twisted PageGetter

Pfui

10:53 am on Sep 16, 2009 (gmt 0)

91.194.158.nnn
Twisted PageGetter

robots.txt? NO

Old bots never die. People/ISPs like this one from the UK's Surfcontrol just keep unleashing them.

[webmasterworld.com...]

jdMorgan

3:58 pm on Sep 17, 2009 (gmt 0)

It looks like this replaces the previous user-agents of

ScSpider/<version, etc.>
and the exact match for
SurfControl
from Websense.

I allow these requests, in order to retain whatever corporate-employee traffic I can, assuming that my content will pass at least some of those employees' corporations' filters (but leaving that up to them, as I'm not going to cloak in order to retain that traffic).

These requests are accompanied by (only) two other headers that you can use to validate them in addition to the IP range:

Via: 1.0 webdefence.global.blackspider.com:8081 WebDefence 3.1.5 (13102) 01p
X-Forwarded-For: 192.168.43.2

None of the usual Connection, Accept, Accept-Encoding, Accept-Language request headers that you'd see from a browser are present, and no additional common proxy headers other than those mentioned appear to be sent.

I assume that the "Via" HTTP protocol identifier ("1.0"), the domain, the port number, and the "WebDefence" name are fixed, but that the version and release sequence numbers may change.

The X-Forwarded-For address is always an internal IP address in the 192.168.x.x range, so there is no clear-cut way to determine if this is a proxied user request, or an asynchronous "check" of a direct user request from a different IP address; I can't deduce this from my (jumbled) raw access logs, but it might be more obvious on sites with less traffic.

To be clear on that, the two common ways for these Web content filters to work are:
1) User makes requests *through* the Web filter, using that Web filter as a proxy.
2) User requests are sent to the filter separately, to "check up on them" after the fact. (In this case, you see the actual user requests first, usually correlated in time with the filter request coming in simultaneously or soon after, but sometimes the "checkup" request happens much later.)

In either case the 'filter' can be located at the user's location (e.g. a firewall/filter in the corporate IT department), or it may be remotely-located (as a remote service provided by a Web filtering company "in the cloud").

[added] With all of the abuse rampant on the web, the use of the dangerous-sounding terms "black spider" and "twisted page getter" in the request headers seems surprisingly ill-advised, and a bit naive and "unprofessional." I don't know why they can't just be up-front and use emotionally-neutral terms like "Web content filter."[/added]

Jim