Blocking subdir, but allowing crawl in deeper dir?

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking subdir, but allowing crawl in deeper dir?

Vulcan315

1:59 pm on Jan 31, 2007 (gmt 0)

I have a scenario where I need to block engines from mysite.com/dir1/ , but want them to crawl pages in mysite.com/dir1/dir2/ ... /Dir1/ has a cgi script that spawns shopping pages, using html in /dir2/.

Can I use:

Disallow: /dir1/

...in my robots.txt file, then direct engines to the files in mysite.com/dir1/dir2/ that I want crawled, via my sitemap? When some engines (MSN in particular) crawl my site, they spawn shopping pages due to crawling the cgi in /dir1/ , then see the actual "real" html pages in /dir2/ and consider them to be dupes.

goodroi

1:21 pm on Feb 1, 2007 (gmt 0)

this is a hard choice for me. in general i find it is best to treat the search engine spiders like babies and try to keep it as simple as possible. ideally you would move your cgi out of /dir1/ and into another subdirectory that only contains files to be blocked. this keeps it very simple for the spiders to understand and less likely for them to screw up.

your other option is to use wildcards aka pattern matching. Google and Yahoo handle it pretty well and MSN is working on it.

Vulcan315

2:13 pm on Feb 1, 2007 (gmt 0)

Unfortunately, the shopping software I have has to stay 1 dir above the html it uses to spawn the files. I've tried all sorts of rewrites over the last 8 years, and nothing works without breaking the SSL.

I was hoping to come up with a way to block the bots from /dir1/, but point them to /dir1/dir2/ via sitemap (where they're already going, anyway. Since the bots are already going to /dir1/dir2/, I just wanted to block them from /dir1/.

Would using:

Disallow: /dir1/

...in the robots.txt file prohibit google from going to files in /dir1/dir2/, if I have links in my sitemap pointing specifically to them?

Dave

goodroi

2:02 pm on Feb 2, 2007 (gmt 0)

User-agent: *
Disallow: /dir1/
Allow: /dir1/dir2/

that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck

Vulcan315

2:11 pm on Feb 6, 2007 (gmt 0)

"
User-agent: *
Disallow: /dir1/
Allow: /dir1/dir2/
that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck"

Thanks... I think what I was actually trying to ask was: If I use

Disallow: /dir1/

Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?

Thanks for the help!
Dave

jdMorgan

3:08 pm on Feb 6, 2007 (gmt 0)

If I use
Disallow: /dir1/
Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?

No. This will disallow any URL-path that *begins with* /dir1/

The code posted above should only be used with Yahoo and Google, and others which you have verified to accept the non-standard "Allow" directive. "Allow" is an unofficial extension to the Robots Standard, and is not supported by most robots.

User-agent: Googlebot
Disallow: /dir1/
Allow: /dir1/dir2/

User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/

is the safest way to code it, although you might also wish to test

User-agent: Googlebot
User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/

The problem is that some robots may not understand the multiple-user-agent record, and may become confused and either fetch/index nothing or everything.

Jim

goodroi

3:52 pm on Feb 6, 2007 (gmt 0)

Yea what Jim said :)

each robot (even among the big search engines) behaves differently. some can handle complex instructions and others fall to pieces when they encounter an unusal line in robots.txt. i strongly recommend that you monitor the crawling of your site anytime you touch the robots.txt. i know of many sites that made casual changes to robots.txt and it resulted in them no longer being crawled or in the search engine index until their robots.txt was fixed.

be careful, keep it simple and good luck.