After thinking this over, I've decided it would be best to completely block ALL bots from accessing my site. The only exception would be the home page (root), and if possible, I'd rather add this exception only to search engines. So basically I want all bots to come to home page, index it if they need to, then just go away.
The only problem is that my site is based on dynamic subdirectories (or subdomains, I'm still deciding on this fact). In other words there is a list of directories that I plan to expand over time, as the website grows. It would be very inefficient to actually add all of these in robots.txt.
Basically, from the home/root page (www.example.com) there are links to
www.example.com/dir1/ (OR) dir1.example.com (still haven't decided which one to use)
www.example.com/dir2/ (OR) dir2.example.com
...and so on. Each directory has it's own (standard) sub-directores..
1. How could I add this exception (or rule) this in robots.txt without actually specifying the whole list. I'm thinking something like
..repeat same thing for other search engines..
2. A greater concern for me is bots that ignore robots.txt, would implementing a trap on the rest of the pages be enough? Any suggestions?
Prefix every one of the sub-directory names with "_" (an underscore).
Worked for me to keep several message boards out of Google and the rest...
See, that's what I'm trying to avoid..having to list all directors. I want to do it globally.
Basically, Allow the root directory, Disallow everything else.
Btw, Probably won't use subdomains, only sub-directories. Sub-domain will always be empty.
Maybe I wasn't clear -- a skill I'm still working on.
Re-jigger your site's directory structure so that all your sub-directory names (or, at least all of'em that you want to cover by this 'trick') are prefixed by a "_".
e.g., doc root:
Then all you'll need in robots.txt would be:
The pattern "/_" matches all the carefully contrived directory names you've set up.
Obviously you'll need to correct all your href's, as well.
Maybe a simple `sed` process...
Once you get past the robots.txt exercises, it'll be time to visit .htaccess -- several orders of magnitude more complicated than robots.txt. Because, you'll find that the rogue robots will not abide by the rules in your robots.txt. (Search WebmasterWorld for "Yanga WorldSearch Bot".)
Ok, I see, Thank you for the clarification. I assumed I'd have to add all _dirs into robots.txt.
I'll think this strategy over, see how I can best implement it. Common search engines also support '*' for patters, I was hoping to use something like 'Disallow: /*', which is technically same idea as yours, except more general. I'll have to research and see how effect it is against actually using a specific character. (ie. _)
As for bad bots, I know, I'll probably implement a 'trap', and get rid of any bots that come through.
Just an update, After testing with Google webmaster tools, the following version successfully blocks everything except home page:
On a different note, I'm thinking about the bot-trap I'm going to implement. I can't help but to be tempted to create an 'infinite loop' as the trap. Now, all 'blocked' pages have meta tags (in addition to robots.txt):
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
so the way I see it, if any decide to ignore these, then they deserve no better than to go for a ride.. Basically a single PHP file that generates a random URL, all of which keep getting re-mapped to this file. So the bot is welcome to click away...
Obviously the other solution is to block the IP of anyone stumbling on this file/directory..