Msg#: 3289125 posted 12:32 am on Mar 22, 2007 (gmt 0)
I'm writing this in response to a few comments people have made at me about how I'm mistaken in that robots.txt is an exclusion only method although I use it successfully for OPT-IN to my site.
Many people mistakenly think of robots.txt as an exclusion only method because of it's initial implementation, but it can easily be adapted to a complete opt-in methodology. The easiest and most secure way to accomplish this and not tip your hand about what user agents you permit is to use a dynamic robots.txt file, just like WebmasterWorld does.
Note that the script INCLUDES 4 search engines and excludes everyone else.
If you didn't have any paths that you wanted to block you could still serve up allowed search engines a placebo robots.txt that looks like this:
User-agent: * Disallow: /placebo/
Obviously that's a hack around exclusion implementations, so what, it works, sue me.
And the rest of the world trying to snoop your robots.txt file would see this:
User-agent: * Disallow: /
Now, to enforce those rules, it would be a simple matter for that PERL script to record the IPs of every request to robots.txt that was told DENY (disallow: /) and then have another script monitoring real-time access to web pages give them 403 errors if they attempted to access any other pages on your web site.
So with a couple of simple scripts you're doing 100% opt-in for bots pretending to play by the rules using an exclusion protocol and blocking those that refuse to take NO for an answer from getting any content.