Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt question.



12:35 pm on Aug 27, 2009 (gmt 0)

5+ Year Member

Hi all,

Say I have this page URL:

Which is a duplicate content version of:

Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /pages

Sitemap: http://www.example.com/sitemap.xml

There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.

Many Thanks!

[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]


1:13 pm on Sep 1, 2009 (gmt 0)

5+ Year Member

IMO, it will disable the page http://www.example.com/pages only. :)

# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.

# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*

I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?


6:56 pm on Sep 4, 2009 (gmt 0)

5+ Year Member

I have a question for the group -

Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html

# Allow Yahoo
User-agent: Slurp
Disallow: /example.html

# Allow MSN
User-agent: msnbot
Disallow: /example.html

# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /

Any feedback is greatly appreciated!


12:25 pm on Sep 5, 2009 (gmt 0)

5+ Year Member

I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.

Does anybody know of any other good tool to test robots.txt syntax other than in GWT?


1:43 pm on Sep 5, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

Responding to the initial post,

 Disallow: /pages 

will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.

Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.

While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.

Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.


Robert Charlton

3:51 am on Sep 15, 2009 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:

<meta name="robots" content="noindex,follow">


4:11 am on Sep 15, 2009 (gmt 0)

5+ Year Member

One of the easy ways is to using following code to block /pages/ only.

User-agent: *
disallow: /pages/index.html

Assuming that the index.html is the main page for /pages/ directory.


Featured Threads

Hot Threads This Week

Hot Threads This Month