Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt question.

12:35 pm on Aug 27, 2009 (gmt 0)

New User

5+ Year Member

joined:Aug 24, 2009
posts: 7
votes: 0

Hi all,

Say I have this page URL:

Which is a duplicate content version of:

Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /pages

Sitemap: http://www.example.com/sitemap.xml

There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.

Many Thanks!

[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]

1:13 pm on Sept 1, 2009 (gmt 0)

New User

5+ Year Member

joined:July 11, 2008
votes: 0

IMO, it will disable the page http://www.example.com/pages only. :)

# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.

# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*

I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?

6:56 pm on Sept 4, 2009 (gmt 0)

New User

5+ Year Member

joined:Aug 29, 2008
votes: 0

I have a question for the group -

Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html

# Allow Yahoo
User-agent: Slurp
Disallow: /example.html

# Allow MSN
User-agent: msnbot
Disallow: /example.html

# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /

Any feedback is greatly appreciated!

12:25 pm on Sept 5, 2009 (gmt 0)

New User

5+ Year Member

joined:Aug 24, 2009
votes: 0

I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.

Does anybody know of any other good tool to test robots.txt syntax other than in GWT?

1:43 pm on Sept 5, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
votes: 0

Responding to the initial post,

 Disallow: /pages 

will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.

Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.

While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.

Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.


3:51 am on Sept 15, 2009 (gmt 0)

Moderator from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
votes: 311

To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:

<meta name="robots" content="noindex,follow">
4:11 am on Sept 15, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 29, 2007
votes: 0

One of the easy ways is to using following code to block /pages/ only.

User-agent: *
disallow: /pages/index.html

Assuming that the index.html is the main page for /pages/ directory.


Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members