Forum Moderators: goodroi
Say I have this page URL:
http://www.example.com/pages/
Which is a duplicate content version of:
http://www.example.com/
Assuming meta tags, canonical tags and 301's are out of the question, if I wanted to block just this page in robots.txt, would this be the correct syntax:
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /pages
Sitemap: http://www.example.com/sitemap.xml
There are other pages within the /pages/ directory that I DON'T want to block so I want to make sure I get this right and ony block that URL from being crawled and not all URLs within the /pages/ subfolder.
Many Thanks!
[edited by: engine at 3:15 pm (utc) on Aug. 27, 2009]
[edit reason] Please use example.com [/edit]
# For most search engines' bots...
User-agent: *
Disallow: /pages/subfolder_1/ # the subforlder_1 includes the pages you want to block.
# Google support some pattern matching, the following blocks the pages in the sub-directory of the directory *pages*, but allow other pages exist in the directory pages directly.
User-agent: *
Disallow: /pages/*/*
I don't test them. Google webmaster world give us a wonderful tool to test robots.txt. Why do you test it by yourself?
:)
Does anyone see a problem with the following:
# Allow Google
User-agent: googlebot
Disallow: /example.html
# Allow Yahoo
User-agent: Slurp
Disallow: /example.html
# Allow MSN
User-agent: msnbot
Disallow: /example.html
# Restrict All Crawlers But The Ones Above
User-agent: *
Disallow: /
Any feedback is greatly appreciated!
Does anybody know of any other good tool to test robots.txt syntax other than in GWT?
Disallow: /pages will disallow the file called "/pages", the directory called "/pages/", and all URL-paths below that directory. Anything that starts with "/pages" will be disallowed.
Robots.txt uses prefix-matching, so any URL that matches the prefix that you put in the Disallow directive will be disallowed.
While Googlebot and a few other search engines' robots support limited pattern-matching, and even an "Allow" directive in some cases, there is no 'universal' solution to this problem other than to fix the structure of your site, and to prevent the duplicate-content problems in the first place.
Also, I'm not sure why you say that "301 redirects are out of the question," but if this is a limitation imposed by your hosting, then it's time to get a new host.
Jim