|Robots.txt question. |
Say I have this page URL:
Which is a duplicate content version of:
Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:
# robots.txt for http://www.example.com/
There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.
[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]
IMO, it will disable the page http://www.example.com/pages only. :)
# For most search engines' bots...
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.
# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
I have a question for the group -
Does anyone see a problem with the following:
# Allow Google
# Allow Yahoo
# Allow MSN
# Restrict All Crawlers But The Ones Above
Any feedback is greatly appreciated!
I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.
Does anybody know of any other good tool to test robots.txt syntax other than in GWT?
Responding to the initial post,
will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.
Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.
While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.
Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.
To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:
<meta name="robots" content="noindex,follow">
One of the easy ways is to using following code to block /pages/ only.
Assuming that the index.html is the main page for /pages/ directory.