Forum Moderators: Robert Charlton & goodroi
However, for some of my most important search terms (which I am on the first page for), several of my competiors rank higher than I do while using duplicate content. In several situations two of them are higher than my page and they are using the SAME 700~ words of content word for word.
How can this be if original content is all important to search engines? It is frustrating to see my original content indexed lower than copied content. By the way, my page rank is 3, theirs is 4. Is this the reason?
Incidentally I'm in the same boat. Watching other people soar to the top with little effort when it's taken me 18 months. Truly sickening.
First there is the situation where a page on one domain has the same content as a page on another domain. In this case, Google has a complex algorithm that tries to filter the results so that they are not swamped with copies of the same information in the user's top results. In other words, the Google algorithm will try to pick one domain and filter the others to a lower position (or cluster them in a "similar pages" listing) -- but this is not a penalty inflicted on a site, it's a filter on the results. It might feel like a penalty if your domain is the origin of the content and your page gets filtered out of the first page results, but there's no "black mark" against you in this case.
And yes, the filters are not perfect at all, and sometimes copies of the same content can dominate a result.
A second case occurs when, through some poorly thought out code, many differing urls on the SAME domain all resolve to the same content. There are all kinds of errors in planning and configuring a server that can make this happen. A common error, for instance, is putting tracking variables into a query string. A different query string is a different url! Another mistake is using a redirect to serve a "custom error page" instead of returning a 404 http header.
Google tries to select the dupes and then put all but one of them into the "supplemental index". If a domain has just a few instances of duplication like this in the Google index, things tend to go on as normal. But when many, many urls start showing up, all with identical content, then something seems to get tripped at Google and a site can start to see trouble.
I think a lot of that trouble is just a protection for googlebot - to keep from spending all kinds of bandwidth grabbing one copy after another of the same thing. But whatever the exact programming reason, urls that used to bring in traffic can start to drop from the search results.
Case #1 - duplicate content is on different domains
Case #2 - duplicate content is all on the same domain
How to deal with this? In case #1, keep building those "signals of quality" in order to compete against the other domains. Or maybe file a few DMCA complaints! In case #2, locate the problem and fix it -- insure that any bit of content can only be accessed by one unique url, and that every unique url only ever gets one consistent bit of content.
There are other varieties of duplicate content that can get more complex -- for example, where the duplication is across two or more domains that have a "Hilltop" relationship (next to last domain token is the same, a shared IP address, domains are heavily interlinked, etc). This type of duplicate content can cause trouble. Naively configured "domain forwarding" that does not use a 301 redirect, for example, can create this kind of problem, especially if many domains are involved.
For examples of duplicate content handling at Google, I found it can help to search for widely re-published news stories and study how those results are handled.
--------
In the case that steveweber123 mentions in the opening post -- yes, Page Rank of the url can be one factor here. Others might be anchor text within the site or on inbound links, or the kind of template that holds the content, titles of linking pages and so on.
It looks to me like you can beat a higher PR page if you have more natural inbound links that point DIRECTLY to the page. The competition may still have higher PR because they have lots more links elsewhere and they circulate that PR well, but I think I see evidence that Google will value deep and direct inbound links.
If Google puts the real page in the supplemental index it is in limbo land. The other copy isn't likely to rank due to lack of links, etc (signals of whatever).
When it comes to software written by wetware to control hardware you should expect some errors somewhere.
In "Case #2" external links may point to different URLs which result in the same page of information displayed. So page rank from the links is split among the different URLs instead of all applying to one page. This is the same problem caused by having multiple domains resolve to the same site without a 301 redirect (for example having www.domain.com and domain.com going to the same site without redirecting one to the other).
-- Roger
.
As for the "Error Page", it MUST return a HTTP response of "404". Nothing else will do. That will signal to the bot that there is nothing at that URL to be indexed.
Far too often I see a site where if you ask for any page that does not exist (either a page that has gone, or just a simple typo in the URL), you are served a 302 redirect which takes you to a page with some basic site navigation on it.
The problem is, is that that URL will be indexed with the Error Page content, and so will all of the other URLs (potentially an infinite number) that lead to that same Error Page content.
So, once your error page content is indexed under thousands of different URLs, you have a big Duplicate Content problem on your hands; especially if you have, say, 100 real pages of content, and 500 "fake" Error Page URLs indexed. Those mis-indexed URLs are going to trip some trigger to cause you a lot of grief.
Make sure your Error Page really does return "404 Not Found" in the HTTP Header. Use WebBug to check that it really really really does this.