| 5:33 pm on Nov 17, 2008 (gmt 0)|
if you block a directory, google will not access anything in that directory or its subdirectories ... unless you include a line in the robots.txt to allow it.
This would block example.com/wiki/ but still allow google to access example.com/wiki/sitemaps/.
Don't forget you can test how Google would react to a robots.txt by logging into Google Webmaster Central and visiting their tool section.
| 3:09 am on Nov 18, 2008 (gmt 0)|
Right thanks. The other problem I came across though was that for the above, sitemaps don't work in parallel subdirs, i.e. they must be in the same dir or a parent dir as their URLs. Annoying!
So a bit of mod_rewrite magic fixed that problem - now /kb/sitemap.xml now actually calls the file at /wiki/sitemaps/sitemap.xml !
Re-submitted to Google, 99% confident it'll be happy.
| 4:26 pm on Jan 1, 2009 (gmt 0)|
"Allow" is not standard syntax. It might not work.
| 6:00 pm on Jan 1, 2009 (gmt 0)|
|"Allow" is not standard syntax. It might not work. |
I know it's not in the original document, but it's in the 1997 rfc - have you ever come accross a legitimate bot that doesn't support "Allow"?
| 8:23 pm on Jan 1, 2009 (gmt 0)|
|have you ever come accross a legitimate bot that doesn't support "Allow"? |
Sure, how 'bout archive.org's ia_archiver to name just one?
It's a comparatively recent development that the major bots started to support Allow directives. Until Google started to recognize Allow and implemented wildcards in pattern matching (and the other majors followed), maybe about two years ago, the *only* supported directive was Disallow. There are still legitimate bots out there that don't recognize Allow directives and wildcard pattern matching. Use either, but then don't be surprised if your "blocked" content makes it into an index somewhere.
| 8:45 pm on Jan 1, 2009 (gmt 0)|
thanks, I wasn't aware of that. I had never really thought about it and when I finally had the opportunity to use Allow, I only checked google and the other majors and it was fine.
Though I probably also didn't get in any trouble because we usually Disallow everyone and then specifically Allow the big ones in, saving on traffic for obsolete search engines that never get us any traffic (at least in the german market, that may be different in the rest of the world where google is not above 90% market share).