Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Robots.txt case-sensitivty detection causing loss in rankings?

         

doughayman

1:29 pm on Jun 3, 2007 (gmt 0)

10+ Year Member



Can anyone offer any advice on suggestions on the following:

I have a 10+ year-old established domain that is hosted on an ancient Windows webserver platform (O'Reilly & Associates Webserver V1.1). I have a lot of custom code written here, so it is not practical for me to port to IIS or other webserver.

Anyway, I made a decision years ago, to make use of mixed case in my URL's (I know, I can hear the groans......). For example, my home page looks something like this:

www.domain.com/HomePage.htm

I have ranked high in Google for many keyword terms for the last 10 years. Googlebot regularly indexes my home page:

www.domain.com/HomePage.htm

On occassion (maybe quarterly), Googlebot has fetched my home page which is devoid of case:

www.domain.com/homepage.htm

Whenever this happens, several days later I get hammered in the SERPs (probably for a duplicate content penalty - Google thinks that I have 2 pages that are indentical, even though my webserver is serving up the same page from an Operating System standpoint - Windows cannot distinguish between upper and lower case filenames).

Whenever this happens, about 1 week later, my SERPs return - my guess is that Google figures out (perhaps from a different phase of filtering) that this is one and the same page, and removes this "duplicate penalty", and all is restored to normal. Like I said, this happens about once a quarter (or 4 times a year).

Now, to alleviate this "quarterly" problem, I decided to try and resolve this temporary problem by creating a ROBOTS.TXT file, and attempting to "disallow" the all-lowercase file entry from being spidered. In Robots.txt, I did this via a:

Disallow: /homepage.htm

2 weeks ago, this lowercase entry was finally fetched by Googlebot (it was noted in the Google Sitemap utility Diagnostics page, under "URL's restricted by robots.txt"). Several days later, my rankings tanked incredibly - to the point that I'm at about Ranking # 800, for terms that I normally ranked in the Top 10 for, AND after waiting my normal week (plus a 2nd week), these rankings have not returned.

My simplistic thoughts here are that given that I am no longer being hit with the Duplicate Content penalty (by virtue of trapping the lowercase page entry with Robots.txt), I am instead being hit with some other sort of penalty. My naive solution at this point, is to restore things to where they once were (i.e., remove Robots.txt disallow statement), but that will mean that I may have to wait 3 months to test this theory out.

BTW, this page has had minimal content pages made to it over the last couple of years, so there is nothing else on my end that could have caused this problem.

Of course, a recent algo change could have effected this new behavior, but it is too coincidental to the Googlebot filtering of my Robots.txt entry, in my opinion.

Does anyone have any further thoughts on this?

g1smd

11:58 pm on Jun 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You are hit by completely removing all versions of your homepage from the index.

As far as I know, the entries in robots.txt are not case sensitive.

All case permutations are removed for all stated names here.

doughayman

12:59 am on Jun 4, 2007 (gmt 0)

10+ Year Member



g1msd,

I believe the entries are case-sensitive, as that is what the Robots.txt spec says. Also, my homepage (HomePage.htm) is still in the Google index - it hasn't been blown away at all.

Asia_Expat

4:56 am on Jun 4, 2007 (gmt 0)

10+ Year Member



Robots.txt IS case sensitive and I got caught out by this. I've fixed the issue by carefully adding the correct entries in the robots file for both upper and lower case versions of some 'out of the box' forum software that has some duplicate content issues I'm successfully controlling with robots.txt.

[edited by: Asia_Expat at 4:57 am (utc) on June 4, 2007]

doughayman

12:17 pm on Jun 5, 2007 (gmt 0)

10+ Year Member



Well, my ranking came back yesterday, so it just seemed like a prolonged period from normal. Hopefully, with the hand-crafter ROBOTS.TXT file to catch the lower-case entries, I can eradicate this problem over time. What a pain!

doughayman

11:27 am on Jun 14, 2007 (gmt 0)

10+ Year Member



< System: The following message was spliced on to this thread from another location >

Hello all,

I'll try to be as succinct here as possible. I have a Windows-based webserver, that for obvious reasons cannot discriminate between upper and lower-case URL's (both OS and webserver). For a multitude of reasons I cannot switch to a new technology platform, so please do not recommend this as a solution.

I have had a domain for about 11 years, and have ranked well in the SERPS during this time. The main page of this domain has typically had a PR4, but has recently gone to a PR0.

Google, in accordance with their specs, has always fetched my main page in mixed case, which is how I designed it (I know, probably not the best decision, but this is what I decided on back in 1996).

My main page has the following format:

www.domain.com/HomePage.htm

On occasion (about 4 times per year), I notice that Googlebot fetches, the all-lowercase version of this URL:

www.domain.com/homepage.htm

Whenever this happens, my rankings would always tank for about a week, only to return to their previous levels. Also, on occasion, I would notice that both the mixed-case and all lowercase versions would BOTH be in the Google index. I attributed the week-long tank to some form of duplication penalty. It would always ultimately get sorted out by Google, and everything would return to normal, after about a week.

In an attempt to remedy this quarterly week-long tanking, and through some advice on this website, I decided to enable a ROBOTS.TXT file which specifically disallowed the all-lowercase entry from getting spidered.

Lo an behold, last month Googlebot attempted to spider my all-lowercase URL entry, and it was "trapped" by ROBOTS.TXT (I verified this through Google Sitemap tools).

Since this time, my page (mixed case) has been reduced from a PR 4 to a PR 0, and I have tanked (a la -950 penalty) for most (but not all) 2, 3, and 4-word phrases in the SERPS. I am still in the Google indices for this page (I'm not supplemental), but near the very bottom (-950).

The all-lowercase entry is no longer in the Google index, and the mixed-case version is by itself, but I now have a PR 0 for this page, and my rankings have not comeback at all after several weeks. This duration of tanking has never lasted this long for me, over the last 11 years. Hence, I'm attributing this latest ROBOTS.TXT action as the catalyst for this most recent behavior.

Can anyone shed any light on what may have happened here, and perhaps make some recommendations on what to do next? Has anyone else here had a similar plight?

Thanks in advance!

[edited by: tedster at 12:35 pm (utc) on June 14, 2007]

doughayman

1:10 pm on Jun 14, 2007 (gmt 0)

10+ Year Member



P.S. to last post,

When I use some of the public domain PageRank checkers available on the internet, there appears to be discrimination on the pagerank between my different case-sensitive URL's:

- www.domain.com/HomePage.htm has a PR of 4.

- www.domain.com/homepage.htm has a PR of 0.

When I access this page in my IE browser, my server doesn't care about case-sensitivity, but the page that gets rendered in my browser shows a PR of 0 (equating to the all-lowercase version of my URL - which is not currently in the Google indices, BTW).

g1smd

10:50 pm on Jun 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I assume that you host using IIS. This is a known "Duplicate Content" issue. It is a site issue, not a browser issue.

Each page of your site has a massive number of alternative URLs created from every possible combination of Upper and Lower case characters, one character change at a time.

You need to fix this so that only one URL directly returns content, and all the other combinations return either 404 or a 301 redirect to the canonical URL for that page.

doughayman

12:49 am on Jun 15, 2007 (gmt 0)

10+ Year Member



g1msd,

Actually, I'm running a webserver called O'Reilly "Website", which runs on my Windows 2000 platform. It is old technology, no longer supported, but I have a lot of proprietary customized CGI scripts (not directly portable to IIS), that keeps me with this platform.

I understand the canonical issues, and I made a big mistake years ago dealing with the mix of upper/lower case. However, the thought of deleting these pages, waiting 6 months, and then sticking to all one case, is too large a risk and hit.

I'm looking for other options as well.

Thanks!

g1smd

1:21 am on Jun 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A 301 redirect to the canonical URL on a per-page basis is about the only thing that can save you.

Some clever scripting to add a meta robots noindex tag when the "wrong" URL is requested could also help, but that will still play havoc with internal linking, spidering, and Pagerank.

doughayman

1:37 am on Jun 15, 2007 (gmt 0)

10+ Year Member



Unfortunately, my webserver does not allow for any native 301 redirects via adminsitration/configuration, as I've tried that. And yes, I would have had to do it on a per page basis. Moreover, I don't believe, but I would need to verify, that I could implement this in any sort of lookup table via inbound scripting. I don't believe that my webserver supports any sort of scripting that can intercept an incoming request. I would probably need the source code for that, and need to modify it, to make this work. Unfortunately, a non-starter.

As far as your other recommendation, I would also need to check, but I'm not aware of a ROBOTS.TXT noindex tag......I used the "Disallow" tag, and this did prevent access to the all-lowercase entries correctly, but seems to have been the catalyst for my current state in Google. I cannot introduce a meta noindex tag in my actual URL files, since Windows doesn't allow us to discriminate between files that have upper/lowercase in them (all maps to a single filename).

Thanks for your concern and suggestions. All good thoughts.