homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How to use Robots.txt as an INCLUSION method
It's exclusion only if you use it that way
incrediBILL




msg:3289127
 12:32 am on Mar 22, 2007 (gmt 0)

I'm writing this in response to a few comments people have made at me about how I'm mistaken in that robots.txt is an exclusion only method although I use it successfully for OPT-IN to my site.

Many people mistakenly think of robots.txt as an exclusion only method because of it's initial implementation, but it can easily be adapted to a complete opt-in methodology. The easiest and most secure way to accomplish this and not tip your hand about what user agents you permit is to use a dynamic robots.txt file, just like WebmasterWorld does.

See the simple Perl script Brett uses here:
[webmasterworld.com...]

Note that the script INCLUDES 4 search engines and excludes everyone else.

If you didn't have any paths that you wanted to block you could still serve up allowed search engines a placebo robots.txt that looks like this:

User-agent: *
Disallow: /placebo/

Obviously that's a hack around exclusion implementations, so what, it works, sue me.

And the rest of the world trying to snoop your robots.txt file would see this:

User-agent: *
Disallow: /

Now, to enforce those rules, it would be a simple matter for that PERL script to record the IPs of every request to robots.txt that was told DENY (disallow: /) and then have another script monitoring real-time access to web pages give them 403 errors if they attempted to access any other pages on your web site.

So with a couple of simple scripts you're doing 100% opt-in for bots pretending to play by the rules using an exclusion protocol and blocking those that refuse to take NO for an answer from getting any content.

It doesn't get any easier now does it?

 

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved