homepage Welcome to WebmasterWorld Guest from 174.129.76.87
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt
Banning dodgy spiders - do they take any notice?
themoff

10+ Year Member



 
Msg#: 107 posted 10:14 pm on Jul 12, 2001 (gmt 0)

I am putting together a robots.txt for my site, and have seen many example ones including UAs such as EmailSyphon. My question is - what's the point? Maybe I'm missing something here, but surely these spiders that are not friendly (either email harvesters, or caching copies of the site etc) can choose to ignore robots.txt? And even if they obey it, surely they could just alter their UA? Is the situation not that it is purely a voluntary standard, and any person writing a spider with the potential to annoy site owners will just build in the capability to either ignore the file, or to mutate UA to an acceptable form?
Cheers, Robin

 

awoyo

10+ Year Member



 
Msg#: 107 posted 10:29 pm on Jul 12, 2001 (gmt 0)

It's my experience that robots.txt is only effective for spiders who voluntarily follow the exclusion protocol, therefore, you are correct. When it comes to rogue spiders, robots.txt is of no particular use. However, when it's used as part of the whole; i.e. robots.txt, .htaccess use of mod_access or mod_rewrite, and finally some scripting measures, it becomes, imho, worthy, of being included into the "system".

Happy hunting!

Jim

themoff

10+ Year Member



 
Msg#: 107 posted 11:05 pm on Jul 12, 2001 (gmt 0)

OK thanks, I just wasn't sure if I was missing something.
What is mod_access/mod_rewrite?

awoyo

10+ Year Member



 
Msg#: 107 posted 7:35 pm on Jul 13, 2001 (gmt 0)

mod-rewrite and mod_access are modules that are, or can be, compiled into the Apache web server and accessed by .htaccess. They allow for the testing of User Agent in the form of ... (for mod_rewrite)

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro
RewriteRule ^.*$ x.html [L]

where EmailSiphon would be the User Agent and x.html would be the file the User Agent is redirected to.

or mod_access will simply deny the user based on User Agent, or IP address.

SetEnvIf User-Agent EmailWolf GoAway
SetEnvIf User-Agent ExtractorPro GoAway
SetEnvIf User-Agent Wget GoAway
Order Allow,Deny
Allow from all
Deny from env=GoAway
Deny from 202.
Deny from 203.

Here EmailWolf is set to env=GoAway and env=GoAway is denied access.

Also, as you can see at the bottom, we're denying access to two entire sets of IP blocks. This type of access control will allow you to deny access to just one IP address, as in 202.21.45.169, or worm your way down the octet, as in 202.21.45., which would deny access to 255 ip addresses belonging to that block.

If you aren't sure what's compiled into your server software you can do httpd -l from a Telnet connection. This should work even if you don't have root. If not, just ask your admin.

If you're not running Apache, but perhaps, IIS, then, I'm sorry for the long huff-n-puff. I know absolutely nothing about IIS. :)

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved