Forum Moderators: Robert Charlton & goodroi
I have a 10+ year-old established domain that is hosted on an ancient Windows webserver platform (O'Reilly & Associates Webserver V1.1). I have a lot of custom code written here, so it is not practical for me to port to IIS or other webserver.
Anyway, I made a decision years ago, to make use of mixed case in my URL's (I know, I can hear the groans......). For example, my home page looks something like this:
www.domain.com/HomePage.htm
I have ranked high in Google for many keyword terms for the last 10 years. Googlebot regularly indexes my home page:
www.domain.com/HomePage.htm
On occassion (maybe quarterly), Googlebot has fetched my home page which is devoid of case:
www.domain.com/homepage.htm
Whenever this happens, several days later I get hammered in the SERPs (probably for a duplicate content penalty - Google thinks that I have 2 pages that are indentical, even though my webserver is serving up the same page from an Operating System standpoint - Windows cannot distinguish between upper and lower case filenames).
Whenever this happens, about 1 week later, my SERPs return - my guess is that Google figures out (perhaps from a different phase of filtering) that this is one and the same page, and removes this "duplicate penalty", and all is restored to normal. Like I said, this happens about once a quarter (or 4 times a year).
Now, to alleviate this "quarterly" problem, I decided to try and resolve this temporary problem by creating a ROBOTS.TXT file, and attempting to "disallow" the all-lowercase file entry from being spidered. In Robots.txt, I did this via a:
Disallow: /homepage.htm
2 weeks ago, this lowercase entry was finally fetched by Googlebot (it was noted in the Google Sitemap utility Diagnostics page, under "URL's restricted by robots.txt"). Several days later, my rankings tanked incredibly - to the point that I'm at about Ranking # 800, for terms that I normally ranked in the Top 10 for, AND after waiting my normal week (plus a 2nd week), these rankings have not returned.
My simplistic thoughts here are that given that I am no longer being hit with the Duplicate Content penalty (by virtue of trapping the lowercase page entry with Robots.txt), I am instead being hit with some other sort of penalty. My naive solution at this point, is to restore things to where they once were (i.e., remove Robots.txt disallow statement), but that will mean that I may have to wait 3 months to test this theory out.
BTW, this page has had minimal content pages made to it over the last couple of years, so there is nothing else on my end that could have caused this problem.
Of course, a recent algo change could have effected this new behavior, but it is too coincidental to the Googlebot filtering of my Robots.txt entry, in my opinion.
Does anyone have any further thoughts on this?
[edited by: Asia_Expat at 4:57 am (utc) on June 4, 2007]
Hello all,
I'll try to be as succinct here as possible. I have a Windows-based webserver, that for obvious reasons cannot discriminate between upper and lower-case URL's (both OS and webserver). For a multitude of reasons I cannot switch to a new technology platform, so please do not recommend this as a solution.
I have had a domain for about 11 years, and have ranked well in the SERPS during this time. The main page of this domain has typically had a PR4, but has recently gone to a PR0.
Google, in accordance with their specs, has always fetched my main page in mixed case, which is how I designed it (I know, probably not the best decision, but this is what I decided on back in 1996).
My main page has the following format:
www.domain.com/HomePage.htm
On occasion (about 4 times per year), I notice that Googlebot fetches, the all-lowercase version of this URL:
www.domain.com/homepage.htm
Whenever this happens, my rankings would always tank for about a week, only to return to their previous levels. Also, on occasion, I would notice that both the mixed-case and all lowercase versions would BOTH be in the Google index. I attributed the week-long tank to some form of duplication penalty. It would always ultimately get sorted out by Google, and everything would return to normal, after about a week.
In an attempt to remedy this quarterly week-long tanking, and through some advice on this website, I decided to enable a ROBOTS.TXT file which specifically disallowed the all-lowercase entry from getting spidered.
Lo an behold, last month Googlebot attempted to spider my all-lowercase URL entry, and it was "trapped" by ROBOTS.TXT (I verified this through Google Sitemap tools).
Since this time, my page (mixed case) has been reduced from a PR 4 to a PR 0, and I have tanked (a la -950 penalty) for most (but not all) 2, 3, and 4-word phrases in the SERPS. I am still in the Google indices for this page (I'm not supplemental), but near the very bottom (-950).
The all-lowercase entry is no longer in the Google index, and the mixed-case version is by itself, but I now have a PR 0 for this page, and my rankings have not comeback at all after several weeks. This duration of tanking has never lasted this long for me, over the last 11 years. Hence, I'm attributing this latest ROBOTS.TXT action as the catalyst for this most recent behavior.
Can anyone shed any light on what may have happened here, and perhaps make some recommendations on what to do next? Has anyone else here had a similar plight?
Thanks in advance!
[edited by: tedster at 12:35 pm (utc) on June 14, 2007]
When I use some of the public domain PageRank checkers available on the internet, there appears to be discrimination on the pagerank between my different case-sensitive URL's:
- www.domain.com/HomePage.htm has a PR of 4.
- www.domain.com/homepage.htm has a PR of 0.
When I access this page in my IE browser, my server doesn't care about case-sensitivity, but the page that gets rendered in my browser shows a PR of 0 (equating to the all-lowercase version of my URL - which is not currently in the Google indices, BTW).
Each page of your site has a massive number of alternative URLs created from every possible combination of Upper and Lower case characters, one character change at a time.
You need to fix this so that only one URL directly returns content, and all the other combinations return either 404 or a 301 redirect to the canonical URL for that page.
Actually, I'm running a webserver called O'Reilly "Website", which runs on my Windows 2000 platform. It is old technology, no longer supported, but I have a lot of proprietary customized CGI scripts (not directly portable to IIS), that keeps me with this platform.
I understand the canonical issues, and I made a big mistake years ago dealing with the mix of upper/lower case. However, the thought of deleting these pages, waiting 6 months, and then sticking to all one case, is too large a risk and hit.
I'm looking for other options as well.
Thanks!
As far as your other recommendation, I would also need to check, but I'm not aware of a ROBOTS.TXT noindex tag......I used the "Disallow" tag, and this did prevent access to the all-lowercase entries correctly, but seems to have been the catalyst for my current state in Google. I cannot introduce a meta noindex tag in my actual URL files, since Windows doesn't allow us to discriminate between files that have upper/lowercase in them (all maps to a single filename).
Thanks for your concern and suggestions. All good thoughts.