homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Robots.txt question.

5+ Year Member

Msg#: 3979530 posted 12:35 pm on Aug 27, 2009 (gmt 0)

Hi all,

Say I have this page URL:

Which is a duplicate content version of:

Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /pages

Sitemap: http://www.example.com/sitemap.xml

There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.

Many Thanks!

[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]



5+ Year Member

Msg#: 3979530 posted 1:13 pm on Sep 1, 2009 (gmt 0)

IMO, it will disable the page http://www.example.com/pages only. :)

# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.

# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*

I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?


5+ Year Member

Msg#: 3979530 posted 6:56 pm on Sep 4, 2009 (gmt 0)

I have a question for the group -

Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html

# Allow Yahoo
User-agent: Slurp
Disallow: /example.html

# Allow MSN
User-agent: msnbot
Disallow: /example.html

# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /

Any feedback is greatly appreciated!


5+ Year Member

Msg#: 3979530 posted 12:25 pm on Sep 5, 2009 (gmt 0)

I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.

Does anybody know of any other good tool to test robots.txt syntax other than in GWT?


WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member

Msg#: 3979530 posted 1:43 pm on Sep 5, 2009 (gmt 0)

Responding to the initial post,

Disallow: /pages

will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.

Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.

While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.

Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.


Robert Charlton

WebmasterWorld Administrator robert_charlton us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Msg#: 3979530 posted 3:51 am on Sep 15, 2009 (gmt 0)

To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:

<meta name="robots" content="noindex,follow">


5+ Year Member

Msg#: 3979530 posted 4:11 am on Sep 15, 2009 (gmt 0)

One of the easy ways is to using following code to block /pages/ only.

User-agent: *
disallow: /pages/index.html

Assuming that the index.html is the main page for /pages/ directory.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved