Welcome to WebmasterWorld Guest from 54.226.194.180

Message Too Old, No Replies

Server-side "Duplicate Content" Issues

     

tedster

8:56 pm on Oct 22, 2007 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



When people think about SEO, they think about content - on the page, across the site, and in backlink URLs on other domains. But there is a bedrock area that accounts for more trouble than most webmasters usually consider. That bedrock is the server technology itself and the platform used to serve the domain. You don't need to be a sysadmin to have the basic knowledge you need to make certain that your site is not built on quicksand. Google works with you and tries to guard against the common problems, but really - this area is your responsibility.

Here are some of the problems that I see routinely, all of them in the general region of duplicate URL troubles. Really, well over half the domains that I look at have some of these issues. We've discussed some of them in various threads, but I thought there would be value to devoting a thread just to this one topic.

If your previously well-ranked site suddenly begins to develop ranking troubles - check here first!

Basic Rule
If two different URLs both return a 200 OK but serve the same document, then you are getting into duplicate content country and that can create major problems. If any two URLs are not an EXACT character match then they are different.

1. Dynamic URLs
What happens if the order of two parameters is reversed? Only one order should result in a 200 status.

2. Rewrite Schemes
Did you take the lazy man's route and key off from a number in the URL - and then just throw a keyword into the filepath so that you have it in the URL? What happens if the number is correct but the keyword is a typo, or even total garbage? With any rewrite scheme, and especially on a site of some complexity, test your set-up with lots of creatively "bad" URLs - really kick that server around.

3. "Custom" 404
If the header for the error page you serve is not 404 (or 410 Gone) then it doesn't matter what the title or body content of that page says. Common problems in this area come from using a 302 redirect when the URL is not found. A 302 redirect on the same domain will usually result in the requested URL indexed, but with the content of your "custom error page." Over time, every bad url ever requested can be indexed as a duplicate urls for that one page. Eventually, your entire domain looks like garbage - and I said eventually. This kind of error can be a timebomb.

4. Double Slashes
Apache has a native configuration that ignores double slashes in the file path and treats them as a single slash. It's best to address this with a rewrite rule.

5. Two Levels of Error Handling
It's common for the server itself to have one native level of error handling, but the platform that serves a url can have a second level. One example would be IIS itself is handling some basic errors, but .NET will handle errors in the query string. I've seen simlar issues with PHP/mySQL and also .jsp/Tomcat websites. I'm pretty sure it can happen on ColdFusion, too. So make sure that BOTH kinds of not-found errors are returning a true 404 http status.

Anyone have more?

tedster

12:36 am on Oct 23, 2007 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I knew I forgot some common issues: Domain Root vs. index.html - another kind of duplicate [webmasterworld.com]

...and of course, the classic "canonical" troubles: Why "www" & "no-www" Are Different [webmasterworld.com]

< These and other duplicate issues are discussed in threads
in our Hot Topics [webmasterworld.com] thread, which is always pinned to the top
of this forum's index page. >

Robert Charlton

3:12 am on Oct 23, 2007 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



There also can be http vs https issues.

And, in an extensionless environment, where there are no .html, .asp, etc file extensions on page filenames, there can be confusion between pages and directories if you set up your pages with trailing slashes. Use trailing slashes for directory urls... no slashes when an extentionless page file is referenced.

nervo

3:31 pm on Oct 23, 2007 (gmt 0)

10+ Year Member



So, what would be better for internal linking (and why)

http://www.example.com/dir/ or http://www.example.com/dir/index.html

Thanks!

[edited by: tedster at 3:56 pm (utc) on Oct. 23, 2007]
[edit reason] switch to example.com - it can never be sold [/edit]

tedster

4:00 pm on Oct 23, 2007 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



For a new site, I would use the first url ending with a slash, and make sure that the index.html version is 301 redirected.

For an established site, I would study what Google already has indexed. If the index.html version is already indexed, then use those versions of the url and 301 redirect in the other direction.

Consistency is the most important goal, here. Do it the same way across the entire site.

nervo

5:12 pm on Oct 23, 2007 (gmt 0)

10+ Year Member



Tedster, thanks for your explanation!
However, I'm afraid I made a mess with an 6-month old site where Google indexed the homepage as www.example.com/ but, while majority of the site directories are indexed with domain root, others are indexed with www.example.com/dir/index.html.

Does that require different approach?

tedster

5:22 pm on Oct 23, 2007 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



With mixed results already indexed, I would usually choose the directory style url without the file name. There may be a short period of adjustment as Google sorts out the changes, but then going forward everything would be optimal.

pageoneresults

5:25 pm on Oct 23, 2007 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There also can be http vs https issues.

I'd like to emphasize this particular issue as those of us on Windows seem to be more susceptible to this. I've seen Google index both http and https versions of a site that was improperly configured. If you do a site: search for your domain and that first listing is https: you have a potential problem.

nervo

5:44 pm on Oct 23, 2007 (gmt 0)

10+ Year Member



Thanks again!

One other thing: Is there a difference for both internal and external linking in using these:

www.example.com
www.example.com/

Tonearm

7:14 pm on Oct 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There also can be http vs https issues.

I had no idea this was a duplicate content issue. So any page should only be able to be accessed via http or https?

jd01

7:22 pm on Oct 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



www.example.com
www.example.com/

Without the trailing / you force your server to add it on... IMO: Better to add it yourself in the link to conserve cpu cycles.

Justin

Shurik

9:29 pm on Oct 24, 2007 (gmt 0)

10+ Year Member



On IIS there is absolutely no way to prevent assess to directories as files
www.sample.com/directory and
www.sample.com/directory/ are treated the same.

IIS silently converts 1 into 2 very annoying! Robots.txt is of no use too.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month