Robots.txt

Forum Moderators: open

Message Too Old, No Replies

Robots.txt

Suggestions sought for "must exclude" agents

biggles

2:40 am on Dec 2, 2002 (gmt 0)

I'm think about extending my robots.txt file to exclude mail harvesting agents due to the amount of email spam I've been getting. I'll also take the opportunity to exclude content harvesters and other bandwidth stealing agents.

I've looked at the WebmasterWorld robots.txt [webmasterworld.com ] for inspiration/guidance and I'm puzzeled by some of the exclusions such as WebmasterWorld Extractor

Also this file appears to differ from the comprehensive robots.txt file that only allows known "nice guy" spiders on the tutorial page - robots4.txt [searchengineworld.com ]. This features some different agents like BlackWidow.

Any suggestions please on a list of "must exclude" agents for a robots.txt file.

Thanks

jdMorgan

4:08 am on Dec 2, 2002 (gmt 0)

biggles,

Generally speaking, very few robots need to be Disallowed in your robots.txt, for the simple reason that most of the bad ones won't read or honor robots.txt.

As a result, different site owners will use differing criteria to determine which robots to exclude. But even the worst of the ones that do honor robots.txt are just a nuisance. They spider your site with a specific purpose - as stated by the owners - that has nothing to do with your site. So they're just wasting their time and your bandwidth.

To exclude the truly bad ones, you'll need to exclude them by user-agent or by IP address range. There is much on-going discussion of this in the Tracking and Logging and Search Engine Spider Identification forums. If you really want to absolutely avoid bad bots going through your site, you'll be disappointed - the best you can really do is block the known ones and try to trap the unknown ones. But it's never 100%.

To protect your e-mail address, I suggest forms-based e-mail, with the template files inaccessible via HTTP. If you put your e-mail address on your site in plain text, it'll be harvested immediately. If you "obscure" it using JavaScript, it may be safe for awhile, and certainly safe from most harvesters. The problem comes in because there will be a JavaScript-savvy harvester (or even a real human!) come through some day, and your e-mail address will be duly noted and then be sold to all bidders. Just my opinion, but if you're in your site for the long term, use forms-based mail and do it before you publish your site. And use a different address for all other correspondence (personal e-mails, jokes, etc.) - These addresses get forwarded endlessly by unknowing friends, and due to the "seven degrees of separation" rule, you can bet that a friend of a friend (seven times) is an e-mail spammer.

Some of the following bots will obey robots.txt. However, there may also be copy-cats that use the same user-agent that will ignore these disallows; They need to be blocked with other methods.

Here's my current list, just for starters:

User-agent: BlogBot
Disallow: /

User-agent: psbot
Disallow: /

User-agent: rabaz
Disallow: /

User-agent: rico
Disallow: /

User-agent: RPT-HTTPClient
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: Lachesis
Disallow: /

User-agent: ScoutAbout
Disallow: /

HTH,
Jim

biggles

5:06 am on Dec 2, 2002 (gmt 0)

Thanks Jim, very helpful - especially the email advice.

I must be losing it - just realise I started a similar thread at [webmasterworld.com ] & the replies there made it clear more robust exclusion rather than robots.txt neede to keep out the bad boys.

Robots.txt

Suggestions sought for "must exclude" agents

biggles

jdMorgan

biggles

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week