Welcome to WebmasterWorld Guest from 54.242.94.72

Forum Moderators: goodroi

Message Too Old, No Replies

If a dir is disallowed, will a sitemap in a subdir be read?

Sitemaps located in non-content directories

     

badbadmonkey

10:46 am on Nov 17, 2008 (gmt 0)

5+ Year Member



Right this is probably a stupid question but one of those which I couldn't find the answer to...

I've got a bunch of automatically generated sitemaps being put in /wiki/sitemaps/. The problem is that /wiki/ is not a content directory; rather, the content from its scripts is presented to a virtual dir /kb/. /wiki/ is then disallowed from robots.txt to keep things clean.

Will Google et al access and read the sitemaps in the /wiki/sitemaps/ dir?

If not, should I use Allow: on the subdir, or move the files somewhere else?

goodroi

5:33 pm on Nov 17, 2008 (gmt 0)

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



if you block a directory, google will not access anything in that directory or its subdirectories ... unless you include a line in the robots.txt to allow it.

For example

User-agent: *
Allow: /wiki/sitemaps/
Disallow: /wiki/

This would block example.com/wiki/ but still allow google to access example.com/wiki/sitemaps/.

Don't forget you can test how Google would react to a robots.txt by logging into Google Webmaster Central and visiting their tool section.

badbadmonkey

3:09 am on Nov 18, 2008 (gmt 0)

5+ Year Member



Right thanks. The other problem I came across though was that for the above, sitemaps don't work in parallel subdirs, i.e. they must be in the same dir or a parent dir as their URLs. Annoying!

So a bit of mod_rewrite magic fixed that problem - now /kb/sitemap.xml now actually calls the file at /wiki/sitemaps/sitemap.xml !

Re-submitted to Google, 99% confident it'll be happy.

g1smd

4:26 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



"Allow" is not standard syntax. It might not work.

janharders

6:00 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



"Allow" is not standard syntax. It might not work.

I know it's not in the original document, but it's in the 1997 rfc - have you ever come accross a legitimate bot that doesn't support "Allow"?

jimbeetle

8:23 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member



have you ever come accross a legitimate bot that doesn't support "Allow"?

Sure, how 'bout archive.org's ia_archiver to name just one?

It's a comparatively recent development that the major bots started to support Allow directives. Until Google started to recognize Allow and implemented wildcards in pattern matching (and the other majors followed), maybe about two years ago, the *only* supported directive was Disallow. There are still legitimate bots out there that don't recognize Allow directives and wildcard pattern matching. Use either, but then don't be surprised if your "blocked" content makes it into an index somewhere.

janharders

8:45 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



thanks, I wasn't aware of that. I had never really thought about it and when I finally had the opportunity to use Allow, I only checked google and the other majors and it was fine.
Though I probably also didn't get in any trouble because we usually Disallow everyone and then specifically Allow the big ones in, saving on traffic for obsolete search engines that never get us any traffic (at least in the german market, that may be different in the rest of the world where google is not above 90% market share).
 

Featured Threads

Hot Threads This Week

Hot Threads This Month