Forum Moderators: Robert Charlton & goodroi
Here are some of the problems that I see routinely, all of them in the general region of duplicate URL troubles. Really, well over half the domains that I look at have some of these issues. We've discussed some of them in various threads, but I thought there would be value to devoting a thread just to this one topic.
If your previously well-ranked site suddenly begins to develop ranking troubles - check here first!
Basic Rule
If two different URLs both return a 200 OK but serve the same document, then you are getting into duplicate content country and that can create major problems. If any two URLs are not an EXACT character match then they are different.
1. Dynamic URLs
What happens if the order of two parameters is reversed? Only one order should result in a 200 status.
2. Rewrite Schemes
Did you take the lazy man's route and key off from a number in the URL - and then just throw a keyword into the filepath so that you have it in the URL? What happens if the number is correct but the keyword is a typo, or even total garbage? With any rewrite scheme, and especially on a site of some complexity, test your set-up with lots of creatively "bad" URLs - really kick that server around.
3. "Custom" 404
If the header for the error page you serve is not 404 (or 410 Gone) then it doesn't matter what the title or body content of that page says. Common problems in this area come from using a 302 redirect when the URL is not found. A 302 redirect on the same domain will usually result in the requested URL indexed, but with the content of your "custom error page." Over time, every bad url ever requested can be indexed as a duplicate urls for that one page. Eventually, your entire domain looks like garbage - and I said eventually. This kind of error can be a timebomb.
4. Double Slashes
Apache has a native configuration that ignores double slashes in the file path and treats them as a single slash. It's best to address this with a rewrite rule.
5. Two Levels of Error Handling
It's common for the server itself to have one native level of error handling, but the platform that serves a url can have a second level. One example would be IIS itself is handling some basic errors, but .NET will handle errors in the query string. I've seen simlar issues with PHP/mySQL and also .jsp/Tomcat websites. I'm pretty sure it can happen on ColdFusion, too. So make sure that BOTH kinds of not-found errors are returning a true 404 http status.
Anyone have more?
...and of course, the classic "canonical" troubles: Why "www" & "no-www" Are Different [webmasterworld.com]
< These and other duplicate issues are discussed in threads
in our Hot Topics [webmasterworld.com] thread, which is always pinned to the top
of this forum's index page. >
And, in an extensionless environment, where there are no .html, .asp, etc file extensions on page filenames, there can be confusion between pages and directories if you set up your pages with trailing slashes. Use trailing slashes for directory urls... no slashes when an extentionless page file is referenced.
For an established site, I would study what Google already has indexed. If the index.html version is already indexed, then use those versions of the url and 301 redirect in the other direction.
Consistency is the most important goal, here. Do it the same way across the entire site.
Does that require different approach?
There also can be http vs https issues.
I'd like to emphasize this particular issue as those of us on Windows seem to be more susceptible to this. I've seen Google index both http and https versions of a site that was improperly configured. If you do a site: search for your domain and that first listing is https: you have a potential problem.