| 1:13 pm on Sep 1, 2009 (gmt 0)|
IMO, it will disable the page http://www.example.com/pages only. :)
# For most search engines' bots...
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.
# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
| 6:56 pm on Sep 4, 2009 (gmt 0)|
I have a question for the group -
Does anyone see a problem with the following:
# Allow Google
# Allow Yahoo
# Allow MSN
# Restrict All Crawlers But The Ones Above
Any feedback is greatly appreciated!
| 12:25 pm on Sep 5, 2009 (gmt 0)|
I've just tested this in GWT and Googlebot is allowed so I would assume the other major SE bots are also allowed and that Disallow: / is keeping all other bots out. I don't see a problem with it, even with the specific page restrictions it appears to work fine.
Does anybody know of any other good tool to test robots.txt syntax other than in GWT?
| 1:43 pm on Sep 5, 2009 (gmt 0)|
Responding to the initial post,
will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.
Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.
While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.
Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.
| 3:51 am on Sep 15, 2009 (gmt 0)|
To block a specific page only... use the meta robots tag instead of robots.txt, on the page you don't want indexed, in the head section. In this case, the syntax would be:
<meta name="robots" content="noindex,follow">
| 4:11 am on Sep 15, 2009 (gmt 0)|
One of the easy ways is to using following code to block /pages/ only.
Assuming that the index.html is the main page for /pages/ directory.