allow some and exclude the rest in robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

allow some and exclude the rest in robots.txt

jcmiras

7:11 pm on Aug 17, 2005 (gmt 0)

Is there such a way to allow some bots (e.g., googlebot, msnbot and slurp) and exclude the rest in spidering my website? Please tell me how.

Dijkgraaf

11:15 pm on Aug 17, 2005 (gmt 0)

Yes, something like the following should work.

User-agent: googlebot
Disallow:

User-agent: msnbot
Disallow:

User-agent: slurp
Disallow:

User-agent: *
Disallow: /

andrea99

11:25 pm on Aug 17, 2005 (gmt 0)

...and exclude the rest

The most annoying bots that spider for email addys or who knows what will just ignore your robots.txt.

The regular ones are best dealt with in .htaccess mod_rewrite or IP denial. If apache.

Some mask their user-agents... Some just can't be stopped. Most hit once and never return so banning IP's just swells your .htaccess file.

There are lots of threads here on how to do this.

Good luck.

jcmiras

12:04 am on Aug 18, 2005 (gmt 0)

"The most annoying bots that spider for email addys or who knows what will just ignore your robots.txt."

So, it means that it is up to the bot if it will follow the robots.txt or not?

Dijkgraaf

12:16 am on Aug 18, 2005 (gmt 0)

Yes, it is totaly up to the bot to obey robots.txt, it doesn't enforce anything.

jcmiras

12:52 am on Aug 18, 2005 (gmt 0)

Wow great. I think, considering the current behavior of email spammers and web site scrappers, web server developers should have to think of a way to strictly enforce the robots.txt just like the .htaccess

Dijkgraaf

1:26 am on Aug 18, 2005 (gmt 0)

There are ways of doing that. Search for "robot traps" and you will find various methods.

I also make sure that bots can't actually find e-mail address on my web site.
For some methods see
[projecthoneypot.org...]

Reid

5:03 am on Aug 19, 2005 (gmt 0)

robots.txt was never mean for access security.
It is simply a guide for good bots to save bandwidth and nothing more.
It is totally up to the bot wether it will obey or even look at robots.txt
Fortunately any of the large search engine bots DO obey robots.txt and it IS very handy for saving bandwidth.
That's all it was ever meant to be.

jdMorgan

5:24 am on Aug 19, 2005 (gmt 0)

Using .htaccess on Apache or similar approaches on MS servers, you can:

Block access by user-agent

Block by IP address or IP address range

Block by referrer

Block by Remote Host (very inefficient for the server, but available if necessary)

Using key_master's and xlcus' bad-bots scripts, you can block bad-bots behaviourally.

key_master's script [webmasterworld.com] (PERL) uses invisible 'trap' links seeded into your pages which are disallowed by robots.txt. If a bad bot ignores robots.txt or doesn't fetch it, then it follows those links. The result is that the script is activated, which adds the offending bot's IP address to a denied-IP list in .htaccess.

xlcus' script [webmasterworld.com] (PHP) blocks access based on the speed of requests. It's good for catching scrapers and site downloaders. Again, once the trap is sprung, the offender's IP address is added to a denied-IP list in .htaccess.

Once a month, you can go through and prune the list to keep your .htaccess file's size reasonable.

Note that the links above lead to threads with modified/enhanced versions of these scripts. These threads contain links to the originals, and I have credited the original authors here.

Jim