Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

How to use Robots.txt as an INCLUSION method

It's exclusion only if you use it that way

12:32 am on Mar 22, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
votes: 99

I'm writing this in response to a few comments people have made at me about how I'm mistaken in that robots.txt is an exclusion only method although I use it successfully for OPT-IN to my site.

Many people mistakenly think of robots.txt as an exclusion only method because of it's initial implementation, but it can easily be adapted to a complete opt-in methodology. The easiest and most secure way to accomplish this and not tip your hand about what user agents you permit is to use a dynamic robots.txt file, just like WebmasterWorld does.

See the simple Perl script Brett uses here:

Note that the script INCLUDES 4 search engines and excludes everyone else.

If you didn't have any paths that you wanted to block you could still serve up allowed search engines a placebo robots.txt that looks like this:

User-agent: *
Disallow: /placebo/

Obviously that's a hack around exclusion implementations, so what, it works, sue me.

And the rest of the world trying to snoop your robots.txt file would see this:

User-agent: *
Disallow: /

Now, to enforce those rules, it would be a simple matter for that PERL script to record the IPs of every request to robots.txt that was told DENY (disallow: /) and then have another script monitoring real-time access to web pages give them 403 errors if they attempted to access any other pages on your web site.

So with a couple of simple scripts you're doing 100% opt-in for bots pretending to play by the rules using an exclusion protocol and blocking those that refuse to take NO for an answer from getting any content.

It doesn't get any easier now does it?