Forum Moderators: goodroi
Can I use:
Disallow: /dir1/
...in my robots.txt file, then direct engines to the files in mysite.com/dir1/dir2/ that I want crawled, via my sitemap? When some engines (MSN in particular) crawl my site, they spawn shopping pages due to crawling the cgi in /dir1/ , then see the actual "real" html pages in /dir2/ and consider them to be dupes.
your other option is to use wildcards aka pattern matching. Google and Yahoo handle it pretty well and MSN is working on it.
I was hoping to come up with a way to block the bots from /dir1/, but point them to /dir1/dir2/ via sitemap (where they're already going, anyway. Since the bots are already going to /dir1/dir2/, I just wanted to block them from /dir1/.
Would using:
Disallow: /dir1/
...in the robots.txt file prohibit google from going to files in /dir1/dir2/, if I have links in my sitemap pointing specifically to them?
Dave
Thanks... I think what I was actually trying to ask was: If I use
Disallow: /dir1/
Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?
Thanks for the help!
Dave
If I use
Disallow: /dir1/Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?
No. This will disallow any URL-path that *begins with* /dir1/
The code posted above should only be used with Yahoo and Google, and others which you have verified to accept the non-standard "Allow" directive. "Allow" is an unofficial extension to the Robots Standard, and is not supported by most robots.
User-agent: Googlebot
Disallow: /dir1/
Allow: /dir1/dir2/
User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/
User-agent: Googlebot
User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/
Jim
each robot (even among the big search engines) behaves differently. some can handle complex instructions and others fall to pieces when they encounter an unusal line in robots.txt. i strongly recommend that you monitor the crawling of your site anytime you touch the robots.txt. i know of many sites that made casual changes to robots.txt and it resulted in them no longer being crawled or in the search engine index until their robots.txt was fixed.
be careful, keep it simple and good luck.