Forum Moderators: goodroi
The only problem is that my site is based on dynamic subdirectories (or subdomains, I'm still deciding on this fact). In other words there is a list of directories that I plan to expand over time, as the website grows. It would be very inefficient to actually add all of these in robots.txt.
Basically, from the home/root page (www.example.com) there are links to
www.example.com/dir1/ (OR) dir1.example.com (still haven't decided which one to use)
www.example.com/dir2/ (OR) dir2.example.com
...and so on. Each directory has it's own (standard) sub-directores..
1. How could I add this exception (or rule) this in robots.txt without actually specifying the whole list. I'm thinking something like
User-agent: googlebot
Allow: /
Disallow: /*
..repeat same thing for other search engines..
2. A greater concern for me is bots that ignore robots.txt, would implementing a trap on the rest of the pages be enough? Any suggestions?
Re-jigger your site's directory structure so that all your sub-directory names (or, at least all of'em that you want to cover by this 'trick') are prefixed by a "_".
e.g., doc root:
/home/ftp/example.com/public_html/index.html
/home/ftp/example.com/public_html/_dir1/
/home/ftp/example.com/public_html/_dir2/
/home/ftp/example.com/public_html/_dir3/
/home/ftp/example.com/public_html/_etc/
/home/ftp/example.com/public_html/opendirectory/
User-agent: googlebot
Allow: /
Disallow: /_
Obviously you'll need to correct all your href's, as well.
Maybe a simple `sed` process...
Once you get past the robots.txt exercises, it'll be time to visit .htaccess -- several orders of magnitude more complicated than robots.txt. Because, you'll find that the rogue robots will not abide by the rules in your robots.txt. (Search WebmasterWorld for "Yanga WorldSearch Bot".)
I'll think this strategy over, see how I can best implement it. Common search engines also support '*' for patters, I was hoping to use something like 'Disallow: /*', which is technically same idea as yours, except more general. I'll have to research and see how effect it is against actually using a specific character. (ie. _)
As for bad bots, I know, I'll probably implement a 'trap', and get rid of any bots that come through.
On a different note, I'm thinking about the bot-trap I'm going to implement. I can't help but to be tempted to create an 'infinite loop' as the trap. Now, all 'blocked' pages have meta tags (in addition to robots.txt):
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
so the way I see it, if any decide to ignore these, then they deserve no better than to go for a ride.. Basically a single PHP file that generates a random URL, all of which keep getting re-mapped to this file. So the bot is welcome to click away...
Obviously the other solution is to block the IP of anyone stumbling on this file/directory..
Any thoughts?