Forum Moderators: goodroi

Message Too Old, No Replies

allow some and exclude the rest in robots.txt

         

jcmiras

7:11 pm on Aug 17, 2005 (gmt 0)

10+ Year Member



Is there such a way to allow some bots (e.g., googlebot, msnbot and slurp) and exclude the rest in spidering my website? Please tell me how.

Dijkgraaf

11:15 pm on Aug 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, something like the following should work.

User-agent: googlebot
Disallow:

User-agent: msnbot
Disallow:

User-agent: slurp
Disallow:

User-agent: *
Disallow: /

andrea99

11:25 pm on Aug 17, 2005 (gmt 0)



...and exclude the rest
The most annoying bots that spider for email addys or who knows what will just ignore your robots.txt.

The regular ones are best dealt with in .htaccess mod_rewrite or IP denial. If apache.

Some mask their user-agents... Some just can't be stopped. Most hit once and never return so banning IP's just swells your .htaccess file.

There are lots of threads here on how to do this.

Good luck.

jcmiras

12:04 am on Aug 18, 2005 (gmt 0)

10+ Year Member



"The most annoying bots that spider for email addys or who knows what will just ignore your robots.txt."

So, it means that it is up to the bot if it will follow the robots.txt or not?

Dijkgraaf

12:16 am on Aug 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it is totaly up to the bot to obey robots.txt, it doesn't enforce anything.

jcmiras

12:52 am on Aug 18, 2005 (gmt 0)

10+ Year Member



Wow great. I think, considering the current behavior of email spammers and web site scrappers, web server developers should have to think of a way to strictly enforce the robots.txt just like the .htaccess

Dijkgraaf

1:26 am on Aug 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are ways of doing that. Search for "robot traps" and you will find various methods.

I also make sure that bots can't actually find e-mail address on my web site.
For some methods see
[projecthoneypot.org...]

Reid

5:03 am on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



robots.txt was never mean for access security.
It is simply a guide for good bots to save bandwidth and nothing more.
It is totally up to the bot wether it will obey or even look at robots.txt
Fortunately any of the large search engine bots DO obey robots.txt and it IS very handy for saving bandwidth.
That's all it was ever meant to be.

jdMorgan

5:24 am on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Using .htaccess on Apache or similar approaches on MS servers, you can:

  • Block access by user-agent
  • Block by IP address or IP address range
  • Block by referrer
  • Block by Remote Host (very inefficient for the server, but available if necessary)

    Using key_master's and xlcus' bad-bots scripts, you can block bad-bots behaviourally.

    key_master's script [webmasterworld.com] (PERL) uses invisible 'trap' links seeded into your pages which are disallowed by robots.txt. If a bad bot ignores robots.txt or doesn't fetch it, then it follows those links. The result is that the script is activated, which adds the offending bot's IP address to a denied-IP list in .htaccess.

    xlcus' script [webmasterworld.com] (PHP) blocks access based on the speed of requests. It's good for catching scrapers and site downloaders. Again, once the trap is sprung, the offender's IP address is added to a denied-IP list in .htaccess.

    Once a month, you can go through and prune the list to keep your .htaccess file's size reasonable.

    Note that the links above lead to threads with modified/enhanced versions of these scripts. These threads contain links to the originals, and I have credited the original authors here.

    Jim

  •