Welcome to WebmasterWorld Guest from 220.127.116.11
Forum Moderators: goodroi
Say I have this page URL:
Which is a duplicate content version of:
Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:
# robots.txt for http://www.example.com/
There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.
[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]
# For most search engines' bots...
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.
# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
Does anyone see a problem with the following:
# Allow Google
# Allow Yahoo
# Allow MSN
# Restrict All Crawlers But The Ones Above
Any feedback is greatly appreciated!
Does anybody know of any other good tool to test robots.txt syntax other than in GWT?
will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.
Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.
While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.
Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.