Welcome to WebmasterWorld Guest from 54.167.185.18

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt question.

   
12:35 pm on Aug 27, 2009 (gmt 0)

5+ Year Member



Hi all,

Say I have this page URL:
http://www.example.com/pages/

Which is a duplicate content version of:
http://www.example.com/

Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /pages

Sitemap: http://www.example.com/sitemap.xml

There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.

Many Thanks!

[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]

1:13 pm on Sep 1, 2009 (gmt 0)

5+ Year Member



IMO, it will disable the page http://www.example.com/pages only. :)

# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.


# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*

I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
:)

6:56 pm on Sep 4, 2009 (gmt 0)

5+ Year Member



I have a question for the group -

Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html

# Allow Yahoo
User-agent: Slurp
Disallow: /example.html

# Allow MSN
User-agent: msnbot
Disallow: /example.html

# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /

Any feedback is greatly appreciated!

12:25 pm on Sep 5, 2009 (gmt 0)

5+ Year Member



I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.

Does anybody know of any other good tool to test robots.txt syntax other than in GWT?

1:43 pm on Sep 5, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Responding to the initial post,

 Disallow: /pages 

will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.

Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.

While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.

Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.

Jim

3:51 am on Sep 15, 2009 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:

<meta name="robots" content="noindex,follow">
4:11 am on Sep 15, 2009 (gmt 0)

5+ Year Member



One of the easy ways is to using following code to block /pages/ only.

User-agent: *
disallow: /pages/index.html

Assuming that the index.html is the main page for /pages/ directory.