homepage Welcome to WebmasterWorld Guest from 54.227.215.140
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Blocking subdir, but allowing crawl in deeper dir?
Vulcan315

10+ Year Member



 
Msg#: 3238252 posted 1:59 pm on Jan 31, 2007 (gmt 0)

I have a scenario where I need to block engines from mysite.com/dir1/ , but want them to crawl pages in mysite.com/dir1/dir2/ ... /Dir1/ has a cgi script that spawns shopping pages, using html in /dir2/.

Can I use:

Disallow: /dir1/

...in my robots.txt file, then direct engines to the files in mysite.com/dir1/dir2/ that I want crawled, via my sitemap? When some engines (MSN in particular) crawl my site, they spawn shopping pages due to crawling the cgi in /dir1/ , then see the actual "real" html pages in /dir2/ and consider them to be dupes.

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3238252 posted 1:21 pm on Feb 1, 2007 (gmt 0)

this is a hard choice for me. in general i find it is best to treat the search engine spiders like babies and try to keep it as simple as possible. ideally you would move your cgi out of /dir1/ and into another subdirectory that only contains files to be blocked. this keeps it very simple for the spiders to understand and less likely for them to screw up.

your other option is to use wildcards aka pattern matching. Google and Yahoo handle it pretty well and MSN is working on it.

Vulcan315

10+ Year Member



 
Msg#: 3238252 posted 2:13 pm on Feb 1, 2007 (gmt 0)

Unfortunately, the shopping software I have has to stay 1 dir above the html it uses to spawn the files. I've tried all sorts of rewrites over the last 8 years, and nothing works without breaking the SSL.

I was hoping to come up with a way to block the bots from /dir1/, but point them to /dir1/dir2/ via sitemap (where they're already going, anyway. Since the bots are already going to /dir1/dir2/, I just wanted to block them from /dir1/.

Would using:

Disallow: /dir1/

...in the robots.txt file prohibit google from going to files in /dir1/dir2/, if I have links in my sitemap pointing specifically to them?

Dave

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3238252 posted 2:02 pm on Feb 2, 2007 (gmt 0)

User-agent: *
Disallow: /dir1/
Allow: /dir1/dir2/

that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck

Vulcan315

10+ Year Member



 
Msg#: 3238252 posted 2:11 pm on Feb 6, 2007 (gmt 0)

"
User-agent: *
Disallow: /dir1/
Allow: /dir1/dir2/
that should work but keep a watchful eye on it. search engine bots are made for quantity and not quality. in other words when the more finesse you try to do the more likely something will go wrong. good luck"

Thanks... I think what I was actually trying to ask was: If I use

Disallow: /dir1/

Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?

Thanks for the help!
Dave

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3238252 posted 3:08 pm on Feb 6, 2007 (gmt 0)

If I use
Disallow: /dir1/

Will the engines to to files in /dir1/dir2/ if I point them there via my sitemap?

No. This will disallow any URL-path that *begins with* /dir1/

The code posted above should only be used with Yahoo and Google, and others which you have verified to accept the non-standard "Allow" directive. "Allow" is an unofficial extension to the Robots Standard, and is not supported by most robots.
User-agent: Googlebot
Disallow: /dir1/
Allow: /dir1/dir2/

User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/


is the safest way to code it, although you might also wish to test

User-agent: Googlebot
User-agent: Slurp
Disallow: /dir1/
Allow: /dir1/dir2/


The problem is that some robots may not understand the multiple-user-agent record, and may become confused and either fetch/index nothing or everything.

Jim

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3238252 posted 3:52 pm on Feb 6, 2007 (gmt 0)

Yea what Jim said :)

each robot (even among the big search engines) behaves differently. some can handle complex instructions and others fall to pieces when they encounter an unusal line in robots.txt. i strongly recommend that you monitor the crawling of your site anytime you touch the robots.txt. i know of many sites that made casual changes to robots.txt and it resulted in them no longer being crawled or in the search engine index until their robots.txt was fixed.

be careful, keep it simple and good luck.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved