homepage Welcome to WebmasterWorld Guest from 54.226.192.202
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Can't succeed in blocking directories in robots txt
The paths are already in robots txt but google still crawles those pages.
stephang

5+ Year Member



 
Msg#: 3850500 posted 9:41 am on Feb 16, 2009 (gmt 0)

Hello everybody!

I am trying to block some specific url on my website using the robots txt but when I checked in google webmaster tools.. I can see that those urls are still crawled by robots.

So I'm trying to block the directory "contact_us" taking into consideration that aaa and bbb are variable.

http://www.example.com/aaa/bbb/contact_us/

And I've inserted the following in my robots txt but does not seem to be working.

User-agent: *
Disallow: /contact_us/

Or should I insert this one?

User-agent: *
Disallow:/*/*/contact_us/

Thank you all for your kind replies! :)

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3850500 posted 12:45 pm on Feb 16, 2009 (gmt 0)

Your second snippet should work -- but only for search engines that explicitly state on their "webmaster help" page that they support wild-cards in robots.txt.

This is NOT part of the Standard for Robot Exclusion, but is a semi-proprietary extension. The standard implementation uses prefix-matching and does not support wild-card URL-paths. For those search engines not supporting wild-card extensions, you will need to state the "aaa" and "bbb" URL-path-parts explicitly, or re-architect your URL structure so that those variables occur at the end of your URL-paths instead of at the beginning. This is something to consider for your next new site or existing site re-design.

For the search engines that *do* support wild-carding, you can try something like:

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
Disallow: /*/*/contact_us/
#
User-agent: *
Disallow: /

This would tell the "big three" not to fetch your /contact_us subdirectories, while telling all others not to fetch anything on your site.

Jim

stephang

5+ Year Member



 
Msg#: 3850500 posted 4:09 am on Feb 17, 2009 (gmt 0)

Thank you jdMorgan.

But why would I block all other robots from fetching anything from my site?

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3850500 posted 4:13 am on Feb 17, 2009 (gmt 0)

All robots are not equal... and some are worse than others!

In reality you allow the bots that bring benefit to your site, ie. visitors. All others need not apply.

choster

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3850500 posted 10:29 pm on Feb 17, 2009 (gmt 0)

Operators of bad bots are unlikely to conform to robots.txt directives. If a bot is badly behaved, you'll need to block it at the server level with other means.

stephang

5+ Year Member



 
Msg#: 3850500 posted 6:27 am on Feb 18, 2009 (gmt 0)

I just came back to tell you that the following works perfectly.

User-agent: *
Disallow:/*/*/contact_us/

Thank you all! :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved