Forum Moderators: phranque
A guide to fixing duplicate content & URL issues on Apache [webmasterworld.com]
How to canonicalize all of your URLs with a single redirect
has once again raised a question whether or not one might correct UPPERCASE letters in requested URIs or simply keep returning a 404 Not Found. The discussion itself shows how to do so if one chooses, this discussion is more along the lines of why one may or may not choose to do so.
First, here is an example of the issue for clarification. Let's say a developer creates everything in lowercase letters. Directories, files and even file extensions are all lowercase so that every URI on the system will be lowercase. Therefore the following URIs ...
http://www.example.com/path/to/mypage.htm
http://www.example.com/path/to/MyPage.htm
200 OKfor the first URI and a
404 Not Foundfor the second.
How would this incorrect URI come about in the first place? Most often we are looking at either type-in traffic or a link that was not keyed correctly by the inbound link developer. I can't call it a misspelled link because that is not the case. The spelling is correct, the CaPiTaLiZaTiOn is not.
Personally, I've always been the type of internet user that if I typed in a URI or clicked a URI that led to a 404 I would figure out how to get to the resource one way or another (if it still existed), working my way in from the site's root or using the site search feature if necessary. But not everybody is like that. An elderly person not as familiar with the concept may just abandon the attempt altogther. The intended audience must be identified and catered to.
What factors determine whether or not you will adjust this issue for any given domain?
The intended audience must be identified and catered to.
And I think it's important here to distinguish between bots and humans.
As you rightly say, the internet-savvy (or experienced) user will probably find the page that they intended to land on, either by guessing the mistake or doing a little digging. A bot/spider, like it's less experienced internet user counterpart, does not have that degree of "URL intelligence". From a webpage marketing perspective, you don't really want to lose the link-love from example.com linking to you just because they typo'd the URL.
So re-writing and correcting URL's, where possible, can be good for a variety of reasons. The audience may just as much be a search engine as a human.
What factors determine whether or not you will adjust this issue for any given domain?
I have never experienced a major problem with this particular issue, but on similar issues I have always looked at balancing server load to reward (reward here being either rankings or positive human user experience).
First, it's important to remember that your logfiles are like gold dust. The answer to the question "what should I do" is probably in there.
If there's one single page that is trying to link me appearing in my log files under the 404 section, and the mistake is really obvious like a case problem, then I would probably return a 301 using the .htaccess file in it's parent folder, and send out the page that I know for a fact they wanted. I certainly wouldn't start building a complex set of rules just for the sake of that.
If there's a whole bunch of typo/incorrect case URL's then it's a more complex problem. If the original linking party is easily contactable, then personally I would always rather fix the mistake than try and work around it. Remember that every line in your .htaccess file is memory and CPU processing used by every single thread that httpd spawns.
If it's not possible to contact the party causing the errors, and you really want to serve the humans and/or bots coming in on those links (i.e. the reward is high), you may not have a lot of choice but build a set of rules into your webserver to handle it.
TJ
Postscript : thinking about what I've written above, I think I would go on to state that I can see no valid reason whatsoever for serving a 404 error header to a bot. If you're serving 404's to bots, you have a problem that needs to be addressed in some way or another, and that same problem is just as likely to be causing issues for less tech-savvy human visitors.
If you have incoming links that point to incorrect URLs, then ultimatly get the other site to correct their errors, but in the meantime cater for it using the 301 redirect.
How would this incorrect URI come about in the first place?
Personally, I've always been the type of internet user that if I typed in a URI or clicked a URI that led to a 404 I would figure out how to get to the resource one way or another
Personally I don't automatically fix the case, because that takes away the opportunity for the linking site to discover the broken link and fix it themselves. I also have a gut feeling that this strategy is suboptimal for Google.
Instead my strategy is to
1) use a custom 404 page that gives people a painless way to find what they had been looking for, and
2) monitor my 404's and periodically check the referring pages. If the page sends a reasonable amount of traffic and/or looks like a good backlink, I manually set up a redirect.