Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Capitalization of Pages creates index problems

         

doughayman

6:11 pm on Feb 18, 2007 (gmt 0)

10+ Year Member



Hello all,

I seem to have capitalization issue with Googlebot, and am looking for suggestions to remedy.

I have a subdirectory architecture under my domain, with a multitude of sites under my domain (this has been in existence for 10 years or so). The structure looks like:

www.domain.com/sub1/HomePageName1.htm
www.domain.com/sub2/HomePageName2.htm
.
.
etc.

Although Googlebot usually picks up these home page names with appropriate capitalization (e.g., HomePageName1.htm), on occassion, Googlebot fetches the page without capitalization (e.g., homepagename1.htm). Although I do NOT have a separate page in this folder without the capitalization, Googlebot believes that it finds it OK, since I see a return code of 200.

Whenever Googlebot fetches the page without capitalization (happens may be once every 2 months or so), the ranking for this page gets severely impacted AND the page rank for HomePageName1.htm goes from its normal PR3 to PR0.

I could rename all such pages to be devoid of capitalization, but I'm afraid of screwing up my rankings in the indices. Does anyone have a simple solution to remedy this situation, since Google apparently is case-sensitive when it comes to naming conventions?

theBear

9:50 pm on Feb 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"but I'm afraid of screwing up my rankings in the indices"

Seems that some of them are already screwed up from time to time as things currently stand.

Your server is lying to the bots.

The solution is to fix the server. I don't deal with windows based systems and your problem sounds like your site is hosted on such a system.

I'd visit the Microsoft server forum [webmasterworld.com] and look for mention of ISAPI rewrite, capitalization of urls.

A WebmasterWorld wide search on duplicate content and canonicalization might also help.

MThiessen

10:50 pm on Feb 18, 2007 (gmt 0)

10+ Year Member



Are you sure that this is crawled from your site?: www.domain.com/sub1/homepagename1.htm

It could be someone else linking to you on their site, but their web creation software is removing the caps. So someone trying to link your www.domain.com/sub1/HomePageName1.htm page may accidentally have the broken link: www.domain.com/sub1/homepagename1.htm instead.

Perhaps it was hand edited in there and the writer failed to capitalize.

The point is, could it be that googlebot spidering this person's page triggered the bot to crawl you, giving you a 404 and showing that page?

In otherwords with this theory, you have no error at all, it is someone else's error (in linking, probably an amateur site) causing this problem?

Just a guess...

helpnow

11:18 pm on Feb 18, 2007 (gmt 0)

10+ Year Member



Just use your httpd.conf file to rewrite the URLs as they come in, and Rewriterule the URLs from lowercase into MixedCase.

We have the same issue, made this fix, and all is well.

doughayman

11:35 pm on Feb 18, 2007 (gmt 0)

10+ Year Member




Are you sure that this is crawled from your site?: www.domain.com/sub1/homepagename1.htm

Yes, this lowercase version of my page was clearly crawled via Googlebot, and it was crawled successfully (Status 200). This leads me to believe that there are now 2 pages, with duplicate content, but tagged as different Doc Id's, in Google. What is scary is that I do NOT have any lower-case references to this page on my site.....this implies an external link that may have triggered the crawl, and if that is the case, then an external link with deliberate canonical capitalization changes, could effectively trigger a duplicate content penalty on me? Ouch!


Just use your httpd.conf file to rewrite the URLs as they come in, and Rewriterule the URLs from lowercase into MixedCase.

OK guys, here is where I am concerned. I am running on a Windows 2000 platform, using an old (but stable) webserver -- O'Reilly and Associates Website Professional V1.1H. This webserver is no longer supported by the company.

Anyone have any solutions for me here? I've already tried doing an explicit redirect from all-lowercase-based filenames to my desired capitalization nomenclature in the server ADMIN, and it didn't appear to do the appropriate redirect.

Unfortunately, in Windows 2000, I cannot create a separate file (all lowercase) of the same name, and throw an explicit 301 redirect in that file.

doughayman

12:19 am on Feb 19, 2007 (gmt 0)

10+ Year Member



Is creating a Robots.txt file a possibility? Are its entries case-sensitive?

Drew_Black

3:30 am on Feb 19, 2007 (gmt 0)

10+ Year Member



I'm not familiar with the O'Reilly web server but there are ways to work around this if running IIS.

You can change the script mapping of .htm files so they are parsed by the ASP script engine. Then edit the page to include a bit of ASP to inspect the URL to see of it matches the correct capitalization. If it's not properly capitalized, clear the buffer, response.status "301 Moved Permanently", response.addheader "Location","http://thecorrecturl".

theBear

3:37 am on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The robots.txt entries are indeed case sensitve so you can block the ones that don't belong.

[edited by: theBear at 3:37 am (utc) on Feb. 19, 2007]

vincevincevince

7:43 am on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First - move to a more updated web-server. You can install a perfectly serviceable Apache installation on W2K server.

Second - analyse all your logs for the lower-case name for entries with something in the referrer field. If you find it then you know which external site is giving you lower-case links.

ramachandra

10:55 am on Feb 19, 2007 (gmt 0)

10+ Year Member



One my site running on IIS Server, has got the same problem with indexing a page with 2 urls like

www.mydomain.com/redwidget.html
www.mydomain.com/RedWidget.html

This happen because of our mistake, in one of our internal page we have accidently used the link page
name with uppercase and IIS allows both page names and Google cached the same page and treated as two
different pages.

We changed the link with lowercase, now uppercase cached page not showing in regular index nor in supplemental, only the lower case page is in regular index.

doughayman

2:20 pm on Feb 19, 2007 (gmt 0)

10+ Year Member



Thanks for the feedback, but moving to another webserver is a non-starter for me, due to certain technical issues. Also, I do NOT have any of the lower-case references on my server -- obviously, Google picked up on the all-lowercase version of the URL from an external link.

Anyone have any other solutions? It really appears that external users can effect Windows-based website ranking, by linking to alternate permutations of case-sensitive URL's. I'm amazed that Google penalizes for this.

theBear

3:43 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



" I'm amazed that Google penalizes for this. "

I'm not.

Your server lied and in your case Google picked the "wrong version" of the duplicate content to rank relative to other pages returned for some search phrases.

This is what a massive number of WebmasterWorld posts have boiled down to for a very long time now.

Some folks choose to ignore it and hope it will go away, but it keeps coming back and biting them.

doughayman

4:22 pm on Feb 19, 2007 (gmt 0)

10+ Year Member



Bear,

I respectfully beg to differ. Google's rules on this cleary cater to *nix-based systems, which they are deployed and developed on. Ignoring aspects of other OS's (e.g., Windows, which is not case sensitive), indicates to me that Google favors *nix, and is not in step with Windows-based servers, which are becoming more and more Prevalent by the day.

It is certainly not that the Windows server is lying - more like that is the best they can do under the circumstance of their underlying file system.

Further, this violates Google's dictum that an external source can never effect the rankings of a site. This is a clear case where an external source linked to my site, and caused a duplicate content penalty. I'm not saying that this instant was malicious, but it surely leads one to believe that a malicious competitor could exploit this Windows/Google anomoly (note that I am sharing the blame here).

jimbeetle

5:14 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It appears that your server supports some form of content negotiation. It might be best to read through some of the Apache documentation [httpd.apache.org] to first get an idea of what's what, then check your server documentation to see if there are settings that you can tweak.

doughayman

5:50 pm on Feb 19, 2007 (gmt 0)

10+ Year Member



Jim Beetle,

Thanks, but I've poured over it. My server does allow for auto redirects (302's), but unfortunately, it does not take case sensitivity into consideration.

I'm going the route of creating "Disallows" in a robots.txt file, which will disallow the spidering of any files that contain mixed case, when it is presented to my server as all lower-case.

This surely will not fix the problem, but should hopefully eliminate the problems from all lower-case attempts to index.

Still, I need a much better solution than this.....

theBear

5:56 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



doughayman,

Googles rules follow the naming conventions for urls.

I would suggest that you become familar with such things, failure to do so will lead to other unwanted results for you.

encyclo

6:04 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What exactly is keeping you on a Windows 2000 server? Are you using ASP or similar?

IIS simply isn't good enough as a hosting environment, as it cannot (due to the intrinsic deficiencies in the OS) handle mixed-case URIs according to specifications. This will cause problems as you have described - it is not Google's bug (as they respect the specifications), but Microsoft's bug.

The best solution is to move the site to a real hosting environment which handles URIs correctly - this means a Linux or similar Unix-style server, usually running Apache. Obviously, if you are tied to Windows via the server-side technology, then your options are more limited.

sailorjwd

6:18 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if all else fails you might include a snippet of code to check the referrer and if it isn't capitalized then add a noindex meta tag.

I do this for a different purpose: for links coming in with parameters vs no parameters. I always noindex if there are parameters.

but then again, I know nothing.

doughayman

8:41 pm on Feb 19, 2007 (gmt 0)

10+ Year Member




Googles rules follow the naming conventions for urls.

I would suggest that you become familar with such things, failure to do so will lead to other unwanted results for you.

Well, I wasn't aware of official naming conventions associated with URL's. What organization has furnished such rules? I come from a programming background, where names are made more readable from a usability standpoint, by using appropriate capitalization and use of punctuation (e.g., underscores). That's where I am coming from.

If I haven't adhered to this "standard" and my Microsoft server doesn't allow me to, what options do I have? As I see it, and others have mentioned already:

1) Move to a webserver platform that will allow me to distinguish
between capitalization vs. non-capitalization. This will
require a new server platform, and will require that I re-write
custom CGI scripts (that make use of internal O'Reilly and
Associates Website Professional V1.1h internal functions). This
is achievable, but will take some work.

2) Use the case-sensitivity of robots.txt be applied as a bandaid,
although this will only be foolproof, if I disallow ALL
permutations of filenames that can include capitalization (this
is virtually next to impossible, with the length of the size
of filenames that I have used);

3) Change all my URL references to lowercase. This would be an
achievable, albeit lengthy and dangerous task. I would have
to ensure that all "local" references are lowercase, I would
have to "block" all mixed-cased references in robots.txt, which
are currently indexed and providing me with rankings, and
I would lose the many link references that I have
established, that refer me via mixed-case references. I don't
see this as a viable option at all, unless I'm starting from
scratch, which I'm not willing to do.

This mixed-case "duplicate content" penalty has happened to me several times in the past, and my rankings always seem to return eventually (after 1 week or 2) - probably when the mixed-case URL's eventually win out in the indices, after having been re-indexed.. Hence, my thinking is that option # 2 is best for me short-term, with option # 1 as a longer term goal.

Anyone have anything else to add to this "mess"?

Thanks to everyone who provided input, by the way.

theBear

9:14 pm on Feb 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can start here: [w3.org...]

and you'll be pointed here:

[gbiv.com...]

for URIs.

The reason that the effect is temporary has to do with the fact that Google trys to remove pages that aren't linked via a path from the root page of the site.

Such pages are called orphaned pages, Google eventualy prunes them out of its index.

[edited by: theBear at 9:18 pm (utc) on Feb. 19, 2007]

doughayman

9:36 pm on Feb 19, 2007 (gmt 0)

10+ Year Member



Bear,

Thanks for the references, and for your input on this matter!

wingslevel

9:57 pm on Feb 19, 2007 (gmt 0)

10+ Year Member



i was having this problem too - even to the point of having 2 separate listings (one or both supps...) in the google index like:

www.mydomain.com/blue_widgets.com
www.mydomain.com/Blue_Widgets.com

I now check case on referrals and 301 case mismatches to the proper page - took about a month, and now the problem is gone.

i got a little more case sensitive myself when i switched my servers over to linux from windows - much less case forgiving...

encyclo

2:50 am on Feb 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



See: [ietf.org...] - section 3.2.3 "URI Comparison" (dated June 1999).

Basically, the issue is that IIS chooses to ignore the strong recommendation (but not outright obligation) in the specification that the URL should be case-sensitive after the scheme and host (domain) name (ie. after the initial slash):

3.2.3 URI Comparison

When comparing two URIs to decide if they match or not, a client
SHOULD use a case-sensitive octet-by-octet comparison of the entire
URIs
, with these exceptions:

- A port that is empty or not given is equivalent to the default
port for that URI-reference;

- Comparisons of host names MUST be case-insensitive;

- Comparisons of scheme names MUST be case-insensitive;

(My emphasis). Note the SHOULD, not MUST. Whilst IIS is not specifically broken according to a strict definition of the spec, the simple fact is that Google are following the RFC in full on this point, and IIS is not.

Google stores its index on Linux servers, which are naturally case-sensitive with file names. I wonder how MS Live (index stored on case-insensitive Windows servers) handles the issue?

texasville

6:40 am on Feb 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Whew! I see serious trouble brewing here.
For instance: If you can access the same url from two different paths such as mysite.com/Page1.htm and mysite.com/page1.htm, you are asking for serious duplicate content problems. You are actually hitting the same url and that is why the googlebot is returning a 200 ok. Most hosts have installed isapi rewrite to cure this problem. It also means you can access this page as mysite/pAge1.htm ad nauseum on the caps.
I had this problem and an entire site ended up in supplemental hell. Moved to an apache server two months ago and 40 percent of the site has come out of supplemental. Some of these pages have been supplemental for over 18 months.
If you want to continue using mixed capitals you MUST move to apache to avoid real problems. There you will not have a problem as a mixed capital page can only be accessed using the exact url.