Forum Moderators: goodroi
I've got a bunch of automatically generated sitemaps being put in /wiki/sitemaps/. The problem is that /wiki/ is not a content directory; rather, the content from its scripts is presented to a virtual dir /kb/. /wiki/ is then disallowed from robots.txt to keep things clean.
Will Google et al access and read the sitemaps in the /wiki/sitemaps/ dir?
If not, should I use Allow: on the subdir, or move the files somewhere else?
For example
User-agent: *
Allow: /wiki/sitemaps/
Disallow: /wiki/
This would block example.com/wiki/ but still allow google to access example.com/wiki/sitemaps/.
Don't forget you can test how Google would react to a robots.txt by logging into Google Webmaster Central and visiting their tool section.
So a bit of mod_rewrite magic fixed that problem - now /kb/sitemap.xml now actually calls the file at /wiki/sitemaps/sitemap.xml !
Re-submitted to Google, 99% confident it'll be happy.
have you ever come accross a legitimate bot that doesn't support "Allow"?
It's a comparatively recent development that the major bots started to support Allow directives. Until Google started to recognize Allow and implemented wildcards in pattern matching (and the other majors followed), maybe about two years ago, the *only* supported directive was Disallow. There are still legitimate bots out there that don't recognize Allow directives and wildcard pattern matching. Use either, but then don't be surprised if your "blocked" content makes it into an index somewhere.