Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies


Banning dodgy spiders - do they take any notice?



10:14 pm on Jul 12, 2001 (gmt 0)

10+ Year Member

I am putting together a robots.txt for my site, and have seen many example ones including UAs such as EmailSyphon. My question is - what's the point? Maybe I'm missing something here, but surely these spiders that are not friendly (either email harvesters, or caching copies of the site etc) can choose to ignore robots.txt? And even if they obey it, surely they could just alter their UA? Is the situation not that it is purely a voluntary standard, and any person writing a spider with the potential to annoy site owners will just build in the capability to either ignore the file, or to mutate UA to an acceptable form?
Cheers, Robin


10:29 pm on Jul 12, 2001 (gmt 0)

10+ Year Member

It's my experience that robots.txt is only effective for spiders who voluntarily follow the exclusion protocol, therefore, you are correct. When it comes to rogue spiders, robots.txt is of no particular use. However, when it's used as part of the whole; i.e. robots.txt, .htaccess use of mod_access or mod_rewrite, and finally some scripting measures, it becomes, imho, worthy, of being included into the "system".

Happy hunting!



11:05 pm on Jul 12, 2001 (gmt 0)

10+ Year Member

OK thanks, I just wasn't sure if I was missing something.
What is mod_access/mod_rewrite?


7:35 pm on Jul 13, 2001 (gmt 0)

10+ Year Member

mod-rewrite and mod_access are modules that are, or can be, compiled into the Apache web server and accessed by .htaccess. They allow for the testing of User Agent in the form of ... (for mod_rewrite)

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro
RewriteRule ^.*$ x.html [L]

where EmailSiphon would be the User Agent and x.html would be the file the User Agent is redirected to.

or mod_access will simply deny the user based on User Agent, or IP address.

SetEnvIf User-Agent EmailWolf GoAway
SetEnvIf User-Agent ExtractorPro GoAway
SetEnvIf User-Agent Wget GoAway
Order Allow,Deny
Allow from all
Deny from env=GoAway
Deny from 202.
Deny from 203.

Here EmailWolf is set to env=GoAway and env=GoAway is denied access.

Also, as you can see at the bottom, we're denying access to two entire sets of IP blocks. This type of access control will allow you to deny access to just one IP address, as in, or worm your way down the octet, as in 202.21.45., which would deny access to 255 ip addresses belonging to that block.

If you aren't sure what's compiled into your server software you can do httpd -l from a Telnet connection. This should work even if you don't have root. If not, just ask your admin.

If you're not running Apache, but perhaps, IIS, then, I'm sorry for the long huff-n-puff. I know absolutely nothing about IIS. :)



Featured Threads

Hot Threads This Week

Hot Threads This Month