homepage Welcome to WebmasterWorld Guest from 54.227.67.210
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Advertising / Pay Per Click Engines
Forum Library, Charter, Moderator: open

Pay Per Click Engines Forum

    
Robots.txt
Suggestions sought for "must exclude" agents
biggles




msg:1231132
 2:40 am on Dec 2, 2002 (gmt 0)

I'm think about extending my robots.txt file to exclude mail harvesting agents due to the amount of email spam I've been getting. I'll also take the opportunity to exclude content harvesters and other bandwidth stealing agents.

I've looked at the WebmasterWorld robots.txt [webmasterworld.com ] for inspiration/guidance and I'm puzzeled by some of the exclusions such as WebmasterWorld Extractor

Also this file appears to differ from the comprehensive robots.txt file that only allows known "nice guy" spiders on the tutorial page - robots4.txt [searchengineworld.com ]. This features some different agents like BlackWidow.

Any suggestions please on a list of "must exclude" agents for a robots.txt file.

Thanks

 

jdMorgan




msg:1231133
 4:08 am on Dec 2, 2002 (gmt 0)

biggles,

Generally speaking, very few robots need to be Disallowed in your robots.txt, for the simple reason that most of the bad ones won't read or honor robots.txt.

As a result, different site owners will use differing criteria to determine which robots to exclude. But even the worst of the ones that do honor robots.txt are just a nuisance. They spider your site with a specific purpose - as stated by the owners - that has nothing to do with your site. So they're just wasting their time and your bandwidth.

To exclude the truly bad ones, you'll need to exclude them by user-agent or by IP address range. There is much on-going discussion of this in the Tracking and Logging and Search Engine Spider Identification forums. If you really want to absolutely avoid bad bots going through your site, you'll be disappointed - the best you can really do is block the known ones and try to trap the unknown ones. But it's never 100%.

To protect your e-mail address, I suggest forms-based e-mail, with the template files inaccessible via HTTP. If you put your e-mail address on your site in plain text, it'll be harvested immediately. If you "obscure" it using JavaScript, it may be safe for awhile, and certainly safe from most harvesters. The problem comes in because there will be a JavaScript-savvy harvester (or even a real human!) come through some day, and your e-mail address will be duly noted and then be sold to all bidders. Just my opinion, but if you're in your site for the long term, use forms-based mail and do it before you publish your site. And use a different address for all other correspondence (personal e-mails, jokes, etc.) - These addresses get forwarded endlessly by unknowing friends, and due to the "seven degrees of separation" rule, you can bet that a friend of a friend (seven times) is an e-mail spammer.

Some of the following bots will obey robots.txt. However, there may also be copy-cats that use the same user-agent that will ignore these disallows; They need to be blocked with other methods.

Here's my current list, just for starters:

User-agent: BlogBot
Disallow: /

User-agent: psbot
Disallow: /

User-agent: rabaz
Disallow: /

User-agent: rico
Disallow: /

User-agent: RPT-HTTPClient
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: Lachesis
Disallow: /

User-agent: ScoutAbout
Disallow: /

HTH,
Jim

biggles




msg:1231134
 5:06 am on Dec 2, 2002 (gmt 0)

Thanks Jim, very helpful - especially the email advice.

I must be losing it - just realise I started a similar thread at [webmasterworld.com ] & the replies there made it clear more robust exclusion rather than robots.txt neede to keep out the bad boys.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Advertising / Pay Per Click Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved