Forum Moderators: goodroi
And it all reminds me of browser code wars -- a la Javascript v. Jscript -- just with a LOT more players.
Anyone else getting tired of keeping up with each SE's peculiarities?
Heck, if I don't want 'em, I'd rather 403 'em. Quick and painless. (Now don't start, you Standards-and-Protocols folks. We've been around that track before:)
A 5,000-character limit will require me to serve individual robots.txt for Bad Guys (nothing shows but the minimum generic stuff),
That way you don't have to keep track of all of the 'other' or 'bad' bots out there, just the good bots.
"if ($agent" part -- [webmasterworld.com...]
(I think that's all of the parts:)
Actually, serving up two different versions -- one for allowed bots and one for all the rest -- has been on my to-do list since last week and I've been meaning to look up some posts by Jim Morgan about how-to when it comes to controlling everything with mod_rewrite. I just hadn't planned on making, updating and serving up a third version just for Google (allowed). Oh, well!
This quote sets off a red flag for me: A single robot will obey one record in your robots.txt -- Either the first one it finds where the User-agent: token matches it's User-agent name, or the one that matches its name most exactly. So, technically, there are no such things as 'rules for robots' and 'rules for pages' -- The control of pages should be part and parcel of the robot-control records.
The Google robots.txt checker tool may have a 5000-character limit, but that does not necessarily apply to the Googlebot itself. Just FYI, I've got sites with robots.txt files ranging from 3kB to 16kB, and *none* have any problems. I would strongly suggest looking at other aspects of this problem instead of filesize.
But a large file is somewhat wasteful of bandwidth over time; In order to reduce the filesize, go through your robots.txt and remove all records pertaining to robots that you do not want on your site, and also those that pertain to robots known to ignore robots.txt
Then add a wild-card record at the bottom to Disallow any robots whose User-agent token does not appear in the preceding records. After you've done so, any 'recognized' and 'good' robot wiil find the record that applies to it and obey that one. Any robot you don't care about will not find a record specific to it, and so will obey the last record -- the wild-card record added above -- and recognize that it should not spider your site. And bad-bots don't heed (or even read) robots.txt at all, so why list them?
If you are strong on server-side scripting, then I'll second physics' recommendation above. But test thoroughly!
Jim