homepage Welcome to WebmasterWorld Guest from 54.226.0.225
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Blocking bots
ag_47




msg:3813654
 11:06 pm on Dec 23, 2008 (gmt 0)

After thinking this over, I've decided it would be best to completely block ALL bots from accessing my site. The only exception would be the home page (root), and if possible, I'd rather add this exception only to search engines. So basically I want all bots to come to home page, index it if they need to, then just go away.

The only problem is that my site is based on dynamic subdirectories (or subdomains, I'm still deciding on this fact). In other words there is a list of directories that I plan to expand over time, as the website grows. It would be very inefficient to actually add all of these in robots.txt.

Basically, from the home/root page (www.example.com) there are links to

www.example.com/dir1/ (OR) dir1.example.com (still haven't decided which one to use)
www.example.com/dir2/ (OR) dir2.example.com

...and so on. Each directory has it's own (standard) sub-directores..

1. How could I add this exception (or rule) this in robots.txt without actually specifying the whole list. I'm thinking something like

User-agent: googlebot
Allow: /
Disallow: /*

..repeat same thing for other search engines..

2. A greater concern for me is bots that ignore robots.txt, would implementing a trap on the rest of the pages be enough? Any suggestions?

 

Jonesy




msg:3813799
 4:18 am on Dec 24, 2008 (gmt 0)

Prefix every one of the sub-directory names with "_" (an underscore).

Disallow: /_

Worked for me to keep several message boards out of Google and the rest...

ag_47




msg:3814483
 7:17 am on Dec 26, 2008 (gmt 0)

See, that's what I'm trying to avoid..having to list all directors. I want to do it globally.
Basically, Allow the root directory, Disallow everything else.
Btw, Probably won't use subdomains, only sub-directories. Sub-domain will always be empty.

Jonesy




msg:3814921
 6:19 pm on Dec 27, 2008 (gmt 0)

Maybe I wasn't clear -- a skill I'm still working on.

Re-jigger your site's directory structure so that all your sub-directory names (or, at least all of'em that you want to cover by this 'trick') are prefixed by a "_".

e.g., doc root:

/home/ftp/example.com/public_html/index.html
/home/ftp/example.com/public_html/_dir1/
/home/ftp/example.com/public_html/_dir2/
/home/ftp/example.com/public_html/_dir3/
/home/ftp/example.com/public_html/_etc/
/home/ftp/example.com/public_html/opendirectory/

Then all you'll need in robots.txt would be:

User-agent: googlebot
Allow: /
Disallow: /_

The pattern "/_" matches all the carefully contrived directory names you've set up.

Obviously you'll need to correct all your href's, as well.
Maybe a simple `sed` process...

Once you get past the robots.txt exercises, it'll be time to visit .htaccess -- several orders of magnitude more complicated than robots.txt. Because, you'll find that the rogue robots will not abide by the rules in your robots.txt. (Search WebmasterWorld for "Yanga WorldSearch Bot".)

ag_47




msg:3814964
 9:51 pm on Dec 27, 2008 (gmt 0)

Ok, I see, Thank you for the clarification. I assumed I'd have to add all _dirs into robots.txt.

I'll think this strategy over, see how I can best implement it. Common search engines also support '*' for patters, I was hoping to use something like 'Disallow: /*', which is technically same idea as yours, except more general. I'll have to research and see how effect it is against actually using a specific character. (ie. _)

As for bad bots, I know, I'll probably implement a 'trap', and get rid of any bots that come through.

ag_47




msg:3821654
 4:58 am on Jan 8, 2009 (gmt 0)

Just an update, After testing with Google webmaster tools, the following version successfully blocks everything except home page:
User-agent: *
Allow: /$
Disallow: /
---------------

On a different note, I'm thinking about the bot-trap I'm going to implement. I can't help but to be tempted to create an 'infinite loop' as the trap. Now, all 'blocked' pages have meta tags (in addition to robots.txt):
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
so the way I see it, if any decide to ignore these, then they deserve no better than to go for a ride.. Basically a single PHP file that generates a random URL, all of which keep getting re-mapped to this file. So the bot is welcome to click away...
Obviously the other solution is to block the IP of anyone stumbling on this file/directory..

Any thoughts?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved