Blocking bots - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking bots

ag_47

11:06 pm on Dec 23, 2008 (gmt 0)

10+ Year Member

After thinking this over, I've decided it would be best to completely block ALL bots from accessing my site. The only exception would be the home page (root), and if possible, I'd rather add this exception only to search engines. So basically I want all bots to come to home page, index it if they need to, then just go away.

The only problem is that my site is based on dynamic subdirectories (or subdomains, I'm still deciding on this fact). In other words there is a list of directories that I plan to expand over time, as the website grows. It would be very inefficient to actually add all of these in robots.txt.

Basically, from the home/root page (www.example.com) there are links to

www.example.com/dir1/ (OR) dir1.example.com (still haven't decided which one to use)
www.example.com/dir2/ (OR) dir2.example.com

...and so on. Each directory has it's own (standard) sub-directores..

1. How could I add this exception (or rule) this in robots.txt without actually specifying the whole list. I'm thinking something like

User-agent: googlebot
Allow: /
Disallow: /*

..repeat same thing for other search engines..

2. A greater concern for me is bots that ignore robots.txt, would implementing a trap on the rest of the pages be enough? Any suggestions?

Jonesy

4:18 am on Dec 24, 2008 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Prefix every one of the sub-directory names with "_" (an underscore).

Disallow: /_

Worked for me to keep several message boards out of Google and the rest...

ag_47

7:17 am on Dec 26, 2008 (gmt 0)

10+ Year Member

See, that's what I'm trying to avoid..having to list all directors. I want to do it globally.
Basically, Allow the root directory, Disallow everything else.
Btw, Probably won't use subdomains, only sub-directories. Sub-domain will always be empty.

Jonesy

6:19 pm on Dec 27, 2008 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Maybe I wasn't clear -- a skill I'm still working on.

Re-jigger your site's directory structure so that all your sub-directory names (or, at least all of'em that you want to cover by this 'trick') are prefixed by a "_".

e.g., doc root:


/home/ftp/example.com/public_html/index.html
/home/ftp/example.com/public_html/_dir1/
/home/ftp/example.com/public_html/_dir2/
/home/ftp/example.com/public_html/_dir3/
/home/ftp/example.com/public_html/_etc/
/home/ftp/example.com/public_html/opendirectory/

Then all you'll need in robots.txt would be:


User-agent: googlebot 
Allow: / 
Disallow: /_

The pattern "/_" matches all the carefully contrived directory names you've set up.

Obviously you'll need to correct all your href's, as well.
Maybe a simple `sed` process...

Once you get past the robots.txt exercises, it'll be time to visit .htaccess -- several orders of magnitude more complicated than robots.txt. Because, you'll find that the rogue robots will not abide by the rules in your robots.txt. (Search WebmasterWorld for "Yanga WorldSearch Bot".)

ag_47

9:51 pm on Dec 27, 2008 (gmt 0)

10+ Year Member

Ok, I see, Thank you for the clarification. I assumed I'd have to add all _dirs into robots.txt.

I'll think this strategy over, see how I can best implement it. Common search engines also support '*' for patters, I was hoping to use something like 'Disallow: /*', which is technically same idea as yours, except more general. I'll have to research and see how effect it is against actually using a specific character. (ie. _)

As for bad bots, I know, I'll probably implement a 'trap', and get rid of any bots that come through.

ag_47

4:58 am on Jan 8, 2009 (gmt 0)

10+ Year Member

Just an update, After testing with Google webmaster tools, the following version successfully blocks everything except home page:
User-agent: *
Allow: /$
Disallow: /
---------------

On a different note, I'm thinking about the bot-trap I'm going to implement. I can't help but to be tempted to create an 'infinite loop' as the trap. Now, all 'blocked' pages have meta tags (in addition to robots.txt):
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
so the way I see it, if any decide to ignore these, then they deserve no better than to go for a ride.. Basically a single PHP file that generates a random URL, all of which keep getting re-mapped to this file. So the bot is welcome to click away...
Obviously the other solution is to block the IP of anyone stumbling on this file/directory..

Any thoughts?