Can't succeed in blocking directories in robots txt

Forum Moderators: goodroi

Message Too Old, No Replies

Can't succeed in blocking directories in robots txt

The paths are already in robots txt but google still crawles those pages.

stephang

9:41 am on Feb 16, 2009 (gmt 0)

Hello everybody!

I am trying to block some specific url on my website using the robots txt but when I checked in google webmaster tools.. I can see that those urls are still crawled by robots.

So I'm trying to block the directory "contact_us" taking into consideration that aaa and bbb are variable.

http://www.example.com/aaa/bbb/contact_us/

And I've inserted the following in my robots txt but does not seem to be working.

User-agent: *
Disallow: /contact_us/

Or should I insert this one?

User-agent: *
Disallow:/*/*/contact_us/

Thank you all for your kind replies! :)

jdMorgan

12:45 pm on Feb 16, 2009 (gmt 0)

Your second snippet should work -- but only for search engines that explicitly state on their "webmaster help" page that they support wild-cards in robots.txt.

This is NOT part of the Standard for Robot Exclusion, but is a semi-proprietary extension. The standard implementation uses prefix-matching and does not support wild-card URL-paths. For those search engines not supporting wild-card extensions, you will need to state the "aaa" and "bbb" URL-path-parts explicitly, or re-architect your URL structure so that those variables occur at the end of your URL-paths instead of at the beginning. This is something to consider for your next new site or existing site re-design.

For the search engines that *do* support wild-carding, you can try something like:


User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow: /*/*/contact_us/
#
User-agent: *
Disallow: /

This would tell the "big three" not to fetch your /contact_us subdirectories, while telling all others not to fetch anything on your site.

Jim

stephang

4:09 am on Feb 17, 2009 (gmt 0)

Thank you jdMorgan.

But why would I block all other robots from fetching anything from my site?

tangor

4:13 am on Feb 17, 2009 (gmt 0)

All robots are not equal... and some are worse than others!

In reality you allow the bots that bring benefit to your site, ie. visitors. All others need not apply.

choster

10:29 pm on Feb 17, 2009 (gmt 0)

Operators of bad bots are unlikely to conform to robots.txt directives. If a bot is badly behaved, you'll need to block it at the server level with other means.

stephang

6:27 am on Feb 18, 2009 (gmt 0)

I just came back to tell you that the following works perfectly.

User-agent: *
Disallow:/*/*/contact_us/

Thank you all! :)