Robots.txt question.

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt question.

seomonster

12:35 pm on Aug 27, 2009 (gmt 0)

Hi all,

Say I have this page URL:
http://www.example.com/pages/

Which is a duplicate content version of:
http://www.example.com/

Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:

# robots.txt for http://www.example.com/

User-agent: *

Disallow: /pages

Sitemap: http://www.example.com/sitemap.xml

There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.

Many Thanks!

[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]

Blan

1:13 pm on Sep 1, 2009 (gmt 0)

IMO, it will disable the page http://www.example.com/pages only. :)


# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.


# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*

I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
:)

SwipeTheMagnets

6:56 pm on Sep 4, 2009 (gmt 0)

I have a question for the group -

Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html

# Allow Yahoo
User-agent: Slurp
Disallow: /example.html

# Allow MSN
User-agent: msnbot
Disallow: /example.html

# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /

Any feedback is greatly appreciated!

seomonster

12:25 pm on Sep 5, 2009 (gmt 0)

I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.

Does anybody know of any other good tool to test robots.txt syntax other than in GWT?

jdMorgan

1:43 pm on Sep 5, 2009 (gmt 0)

Responding to the initial post,

 Disallow: /pages

will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.

Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.

While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.

Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.

Jim

Robert Charlton

3:51 am on Sep 15, 2009 (gmt 0)

To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:

<meta name="robots" content="noindex,follow">

AnkitMaheshwari

4:11 am on Sep 15, 2009 (gmt 0)

One of the easy ways is to using following code to block /pages/ only.

User-agent: *
disallow: /pages/index.html

Assuming that the index.html is the main page for /pages/ directory.