Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Webmaster Tools Sitemap Error - 'URLs Roboted Out'

         

doughayman

12:34 pm on May 10, 2008 (gmt 0)

10+ Year Member



Hi,

I just received a new sitemap warning, and wanted to bounce this off the board:

I am running an "old" Windows-based webserver, that cannot discriminate (obviously) between upper and lowercase filenames. Some of my sites have been in existence for over 11 years, and in my early stupidity I created some filenames with a mixture of case:

e.g., FileName.htm

Well, needless to say I got wrapped up in all sorts of canonical issues over the years, where Google was indexing me for both:

FileName.htm and
filename.htm

This resulted in duplication penalty at times.

Due to the help of the members on this board, and given my old webserver technology (which I am locked into for a variety of reasons), I was able to combat this problem, by entering "Disallow" clauses in my robots.txt file. For example, to eliminate the spidering and indexing of the "filename.htm" file above, I would add a clause into my robots.txt file similar to:

Disallow: /folder/filename.htm

This has been an arduous task for inclusion of necessary files in robots.txt, but has seemed to clear up the canonical/duplication penalty problem.

Now today, I have received, for the very first time, a Google Sitemap warning for "filename.htm". The error verbatim is:

==================================================================
Sitemap erros and warnings

Line Status

URLS ROBOTED OUT

When we tested a sample of the URLs from your Sitemap, we found that the site's robots.txt file was blocking access to some of the URLs. If you don't intend to block some of the URLs contained in the Sitemap, please use our robots.txt analysis tool to verify that the URLs you submitted in your Sitemap are accessible by Googlebot. All accessible URLs will still be submitted.
==================================================================

My reference in my sitemap to this file is "FileName.htm" (i.e., with the mixture of case, which my robots.txt file does not include a "Disallow" entry for. Does this new error above allude to the fact
that Google is no longer discriminating between upper and lowercase filenames ?

Any thoughts on this ?

tedster

9:04 pm on May 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's extremely unlikely that Google (or any search engine) will stop being case sensitive. The W3C spec has always said that filepaths in the url are to be case-sensitive, and that hasn't changed. IMO, Google will not move away from web standards that they've always supported.

However, Google has been trying to warn webmasters about common errors that are due to default server configurations. It's quite possible that they've begun an initiative in the case-sensitivity area and their effort has added to this tangle for you. For example, there may well be some new crawling logic that says "this is a Windows Server, so don't assume that urls in robots.txt and/or Sitemaps are case sensitive."

Even the very latest Windows servers are case-insensitive, so upgrading your server would not be a complete "fix". What it sounds like you need is the paid third-party module, ISAPIRewrite - the minimum requirement for it is Windows 2000 running IIS 5 server. If your server is older than that - I'd think about an upgrade.

The alternative is observing the situation a bit longer and gathering more information. if you aren;t seeing new indexing or ranking troubles, that is an option. If you go that way, I'd suggest these practices:

1. Do a complete inventory of affected urls - how many there are can be a big factor in creating a fix.

2. Monitor Google's indexing and ranking for the urls on your inventory

3. Monitor your server logs for all googlebot requests. That way you can match the GWT error messages with specific events

4. Validate your robots.txt file with Google's own tool in Webmaster Tools

5. Monitor your Webmaster Tools messages quite closely

6. Keep a changelog for any significat actions you take - including when you generate a new Sitemap.

7. You might consider just dropping the Sitemap and letting Google crawl your site "naturally".

You've got a challenge on a Windows server. It's native case-insensitivity will obstruct the redirects that you naturally hope to use. So you might consider changing all your mixed case urls to something completely new, and always use lower case for the new character strings. If you do this, I'd say let the old urls go 404 - forget using redirects and trying to untangle the case sensitive ball of string. Simplify, simplify simplify.

None of this is going to be optimal. Hosting the site on a server where you can install ISAPI Rewrite is the best long-term fix. Anything less than that, and you'll probably be in and out of indexing trouble and pile up a number of kludge style fixes to maintain - and they'll only work short-term.

doughayman

10:07 pm on May 10, 2008 (gmt 0)

10+ Year Member



Thanks for your response, Tedster. I agree with all you say. I run O'Reilly & Associates Webserver (not a version of IIS). I have a lot of code written specifically for that server, and cannot afford the time, currently, for a port and re-write. I've been controlling (rather well) via ROBOTS.TXT entries. But, you are right - it is a kludge, and this recent development worries my quite a bit.

Thanks for your input.