goodroi

msg:3239397 | 1:21 pm on Feb 1, 2007 (gmt 0) |
this is a hard choice for me. in general i find it is best to treat the search engine spiders like babies and try to keep it as simple as possible. ideally you would move your cgi out of /dir1/ and into another subdirectory that only contains files to be blocked. this keeps it very simple for the spiders to understand and less likely for them to screw up. your other option is to use wildcards aka pattern matching. Google and Yahoo handle it pretty well and MSN is working on it.
|
Vulcan315

msg:3239425 | 2:13 pm on Feb 1, 2007 (gmt 0) |
Unfortunately, the shopping software I have has to stay 1 dir above the html it uses to spawn the files. I've tried all sorts of rewrites over the last 8 years, and nothing works without breaking the SSL. I was hoping to come up with a way to block the bots from /dir1/, but point them to /dir1/dir2/ via sitemap (where they're already going, anyway. Since the bots are already going to /dir1/dir2/, I just wanted to block them from /dir1/. Would using: Disallow: /dir1/ ...in the robots.txt file prohibit google from going to files in /dir1/dir2/, if I have links in my sitemap pointing specifically to them? Dave
|
goodroi

msg:3240643 | 2:02 pm on Feb 2, 2007 (gmt 0) |
User-agent: * Disallow: /dir1/ Allow: /dir1/dir2/ that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck
|
Vulcan315

msg:3244120 | 2:11 pm on Feb 6, 2007 (gmt 0) |
" User-agent: * Disallow: /dir1/ Allow: /dir1/dir2/ that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck" Thanks... I think what I was actually trying to ask was: If I use Disallow: /dir1/ Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap? Thanks for the help! Dave
|
jdMorgan

msg:3244153 | 3:08 pm on Feb 6, 2007 (gmt 0) |
If I use Disallow: /dir1/ Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap? |
| No. This will disallow any URL-path that *begins with* /dir1/ The code posted above should only be used with Yahoo and Google, and others which you have verified to accept the non-standard "Allow" directive. "Allow" is an unofficial extension to the Robots Standard, and is not supported by most robots. User-agent: Googlebot Disallow: /dir1/ Allow: /dir1/dir2/ User-agent: Slurp Disallow: /dir1/ Allow: /dir1/dir2/ |
| is the safest way to code it, although you might also wish to test User-agent: Googlebot User-agent: Slurp Disallow: /dir1/ Allow: /dir1/dir2/ |
| The problem is that some robots may not understand the multiple-user-agent record, and may become confused and either fetch/index nothing or everything. Jim
|
goodroi

msg:3244212 | 3:52 pm on Feb 6, 2007 (gmt 0) |
Yea what Jim said :) each robot (even among the big search engines) behaves differently. some can handle complex instructions and others fall to pieces when they encounter an unusal line in robots.txt. i strongly recommend that you monitor the crawling of your site anytime you touch the robots.txt. i know of many sites that made casual changes to robots.txt and it resulted in them no longer being crawled or in the search engine index until their robots.txt was fixed. be careful, keep it simple and good luck.
|
|