homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

Google and duplicate Content
A few comments about duplicate content and google

 5:55 am on Aug 2, 2002 (gmt 0)

I've been following the discussion about Google and mirrored information for some time. It is "common knowledge" that Google penalizes page rank when it determines that content is duplicated somewhere else. In fact, I've read many experts stating that there should be no duplicate domain names and no duplicate content anywhere.

On the face of it the arguments appear to be sound. Google obviously has several billion pages in it's database and could, it appears, easily determine if content is duplicated. It also seems, again on the face of it, that it's reasonable to check for duplicate content, as this is the "mark of a spammer" and not necessary on the web with hyperlinking available. At least, this is the common wisdom.

However, sometimes what seems reasonable and possible is not: not by a long shot.

Let's begin with the technical side of things. You've got domain x and domain y with exactly the same content. How on earth would Google be able to figure that out? Let's say Google had 3 billion pages in it's database. To compare every page to every page would be an enormous task - quadrillions of comparisons.

Now, if site x had page "page1" which linked to site y which also had "page1", then it would be possible for Google to determine the duplicate content. Conceivably, it could check this out.

Not only is the task enormous, but the benefit is so tiny as to be insignificant. Duplicate content does not imply in any way shape or form spamming. In actual fact, a duplicate site is generally going to lower page rank of BOTH sites. Instead of having 100 links to one site, there will presumably be 50 links to one and 50 to another. This would tend (all things being equal) to lower the page ranking of both sites. So Google gains nothing by this incredible expenditure of resources.

There are several reasons for duplicate content which have nothing to do with spamming. Sometimes the content is actually duplicated, and sometimes it's just that there are several different domains (at least the www and non-www versions) for the same website.

Mirroring a site for load balancing - This is very common. The purpose is to split up the traffic between two copies of the site.

Mirroring for region - Sometimes site mirroring is done simply to make it more efficient on the internet backbone itself. You might put an identical copy of a site in Europe, for example, to reduce traffic across the Atlantic, which should make it faster in European countries.

Viral marketing - It's extremely common to allow other sites to republish articles in return for a link.

Different domain names - Sometimes a site might be referenced on many different domain names. You might want to allow the .com, .net and .org versions of the name to all work the same, you might allow for common misspellings or you might cover different keywords (sewing-tips and sewing-secrets are examples of possible combinations).

Different domain names for different markets - you might also want to reference your site by different names in order to target different markets. You could, for example, have a site about search engine optimization and want to target both SEO and web designers. Thus domain names like seo.com and webdesign.com would make sense.

www - Any good webmaster knows his or her site needs to be referenced with and without the www.

Okay, so what's the smart thing to do? Well, it is possible that search engines do compare a limited number of pages to check for duplication. They could certainly check if someone reported something, and they might check directly linked pages (although this is still a heck of a lot of overhead for very little benefit).

Of course, Google and the other search engines can account for a hefty percentage of the traffic received by a site. In fact, sometimes the number can exceed 70 percent. So it's wise to spend some time ensuring that you are totally clean when it comes to search engine optimization. In other words, a technician from any search engine should be able to examine your site down to it's smallest detail and find no evidence of any kind of search engine spamming (attempting to get higher rankings by unethical means). This is absolutely critical to a site's survival for the long term.

Keeping that in mind, here's what I tend to do.

Multiple domains - Using multiple domains to the same site has a tremendous number of advantages. Thus, I tend to follow the advice given by others: take advantage of permanent redirection. In other words, set up a redirection (a 301 status code) which simply tells the browser "this page has moved, proceed to this page, and the move is permanent. This tells the spider about the redirection with no possibility of misunderstanding, yet allows for the multiple domains.

Republished articles - I allow others to republish many of my articles, and at this time I have records of over 10,000 of them all over the internet on thousands of web sites. This is not a problem, as these articles are sent in text format. The webmaster must then drop this text into his site, which requires some reformatting and shuffling around. Thus, the finished articles may have the same text but the formatting is very, very different. This is a highly respected method of gaining a large number of incoming links: I give you something (an article, i.e., content) and you give me something (a link back to my site).

Mirroring - I haven't needed to do this yet, so I have no advice as to what to do if a site requires actual, physical multiple versions of itself. I would tend to just do it overtly (out in the open) and not worry about it.



 7:00 am on Aug 2, 2002 (gmt 0)


Commenting on just one point...

The job of finding exact page duplicates is actually rather easy, and not as big a task as might be

As each page is indexed, compute a checksum, cyclic redundancy code, or "hash" of the page, and save
this along with the URL. After declaring the current spidering run to be finished, sort the list of
checksums according to numeric value. Compare those pages having the exact same checksum using a
character-by-character compare, and quit as soon as you find a mismatch. If the set of files which
must be compared in this manner is found to be too large, simply change the method above to record
two or three different "hashes," using checksum, CRC and byte count, and concatenate them.

Checksums and CRCs (and codes computed using other similar techniques) are commonly used for error
detection and correction in computer data memory, transmission, and storage systems. They can be
viewed as highly-compressed versions of larger data collections. Thus, they can be used to reduce
the job of comparing every page on the Web to every other page into a manageable task by breaking
the task down into smaller pieces; Two pages with the same checksums/CRCs may not be identical, but
two pages with different checksums/CRCs are certainly not identical, and thus need not be compared.

Many duplicate file locater programs intended for cleaning up hard drives simply use the byte count
of the files as a sort index, and then do the byte compare. No it's not lightning-fast, but it is
not slow on a reasonably-current machine. The speed of the hard drive is the limiting factor in all

The above example addresses exact duplicates, as would be found with multiple domains pointed to
the same server or exactly-mirrored sites. Detecting them is simple using checksums, CRC, and byte
count - all easily computed on-the-fly as each page is spidered. Other, more advanced techniques
based on a concept called "Hamming distance" can also be used to find "near matches" but require
comparitively more computing power.

Good post! I just don't want members here to think that finding duplicates is an impossible task,
and lose their PR because of temptation and a false sense of security.



 7:57 am on Aug 2, 2002 (gmt 0)

Good post! It addresses a pertinent issue.

My websites all work without the www prefix and on two of them the backward links for www.mydomain.com and mydomain.com are absolutely identical, and include links to www.mydomain.com and mydomain.com in both listings. As far as I can tell, Google has merged the two names and treats them as one, as one would hope it would.

After reading JDMorgan's post, my guess is that the checksums of both the www and the no www. pages are compared to each other and if they are the same google treats them as the same page.

So, you may be quite correct in your original supposition - that is, duplicate content may not be worth worrying about, at least in the case of www vs no www.


 8:45 am on Aug 2, 2002 (gmt 0)

Terrific post Richard, thank you! Also great follow-ups. :)

There have to be two dozen threads around questioning the issue of multiple domains and duplicate content. It's something that should be thoroughly addressed as much as possible to address all the issues. I posted one myself that's got me concerned:


That's about having articles distributed to several sites, and in view of the one with lower PR being excluded and the higher included, that can be a problem if one is far down in the linking structure.

Personally, I haven't done dual or multiple domains to the same site, but have had time-consuming hassles with people (a couple of which I didn't take on as clients) wanting to point several domain names to the same site - and those would not even have been advantageous.

Those were not hosted with 301's - they were just pointed, as in one case last month that took about 50 long emails to sort out. One site, two domain names, both submitted by her to Inktomi. It changed to 2 separate sites. Another one I'm most likely not doing. One site, 3 domain names pointed, one of which will trip the filter - and also is supposed to target childrens type products.

If people won't do separate sites, and in my experience they fight it, even fighting separate hosting - they just want to point extra domains - I don't think it's worth my effort or time. I do missionary_seo, very much on the straight and narrow, very safe, taking no chances.

In other words, set up a redirection (a 301 status code) which simply tells the browser "this page has moved, proceed to this page, and the move is permanent. This tells the spider about the redirection with no possibility of misunderstanding, yet allows for the multiple domains.

Does that take hosting for the additional domain names, or just "pointing" them, where there is_no separate .htaccess, no separate robots.txt? There is no control over the servers, it's just "rented" virtual hosting with one IP number, in some cases it can be a shared IP. People are fighting against taking hosting for more than one. They want one account with additional domains.

I'm finding it very surprising how many people are into multiple domain names, and there are just too many unanswered questions for the different scenarios for my comfort at this point.


 9:55 am on Aug 2, 2002 (gmt 0)

Take a look at my profile URL. Now, search for the exact address (keep the .org at the end) and you will see a site listed.

However, this isn't my site. Someone has cached my entire front page and now Google has dropped my site because it thinks it's a duplicate!

Annoying beyound belief.


 12:24 pm on Aug 2, 2002 (gmt 0)

Although this was one of the two common methods of spam a few years ago, I think that it can be a mistake to think of duplicate problems as penalties (I'm not saying that anyone here has said that, but it often comes up).

As a searcher, I don't want to visit the same page several times at different addresses. It doesn't matter to me whether the owner wants different domains for regional search engines, TV adverts, brands they've bought, mirroring or because they want to be found with or without the www; it's just not useful to get the same page again, and again. Although most of us probably do, a lot of people who use search engines don't realise that the results are identical until they click.

As mortalfrog points out, merging the results is good for the site owner (unless [s]he is trying to spam) because one good listing is better than two poor listings for the same phrase.

Jim's nice description of ordered hashes covers identical pages, what I find impressive is that Google catches near duplicates as well. To do this efficiently requires some clever programming.


 2:55 pm on Aug 2, 2002 (gmt 0)


Does that take hosting for the additional domain names, or just "pointing" them, where there is_no
separate .htaccess, no separate robots.txt? There is no control over the servers, it's just "rented"
virtual hosting with one IP number, in some cases it can be a shared IP. People are fighting against
taking hosting for more than one. They want one account with additional domains.

No extra hosting is needed; Consolidating multiple domains and their Page Rank can easily be done
under the circumstances you specified above. Some fairly recent discussion here] [webmasterworld.com]. However, this method
(using a 301 redirect) will likely cause all of the duplicate domains to disappear from the index, leaving
all of their PR concentrated on the one "standard" domain you choose to keep. That's what you want
if you are doing it "clean", but probably not what the client with 100 purchased keyword-domain-name
variants intended when he/she paid for all those domain names.



 3:34 pm on Aug 2, 2002 (gmt 0)

Great posts :)

Whilst I agree with Rich that there are a number of good reasons for displaying duplicate content, if the search engines didnt have a method of eliminating very similar pages, their results would be flooded with similar pages taking up top positions and a first page result set would be taken up by one network of very similar or identical pages.

At some stage, almost all of the search engines experienced this in their development. This was extremely prevalent in Excite and Infoseek (although very different engines) and Google, Alltheweb and Altavista all suffered with displaying multiple results of similar or identical pages. This definitely compromised their value offering and they quickly moved to reduce it with various mechanisms.

In the last six months, we have built and introduced 12 sites for clients (strong SEO focus). Small numbers by comparison to most of you. Almost all hit the mark after the usual lag (submissions to Yahoo!, ODP and Looskmart b4 pay for play), aggressive reciprocal linking and then big crawls and subsequent listings.

One missed the boat entirely. After seven months, not one page has been indexed by Google (Googlebot hits the site hard every month). The site does not differ from the other that we have built and importantly has a number of great links pointing to it (including all the directory listings necessary).

The client had 3 domain names and wanted us to alias 2 of them to point to the new one. In the first month, Googlebot ran through all 3 domains (by following links from prominent sites) and continued to do so in the next 3 months without listing any of the sites pages. We removed the aliasing in month 4 and have diligently been gaining reciprocal links from qualified sites.

I am convinced that we have been nailed by a duplicate content filter. Please shout if you feel otherwise and importantly, if you have suggestions of how to get out of this mess.


 4:19 pm on Aug 2, 2002 (gmt 0)

Great post Rich.

I agree with Jim regarding the notion that identifying duplicate content is a difficult task, although I think Google's method of detection is more along the lines of the system AltaVista received a patent for [].

a combination of file name and link structure analysis would be enough to catch most dupe and near nupe content.

I also think that Google now understands most of the points in Rich's post, and have backed off on actually penalizing sites that turn up as duplicates. The only thing that seems to be happening now is the PR/inbound links are being merged, and only one version is being shown.

As an example, In June, I came across a domain name that was once owned by the manucaturer of a specific product. Having a client that operated a site about that product, I bought the domain and put up a duplicate version on the client site on a separate IP.

The original site had a PR of 6, and the new domain had an existing PR of 5 due to all the links that still exist.

After the last crawl, our positions for the original site were replaced by the new domain. Both sites now show a PR 6, and doing a check for backlinks on both domains returns an identical list of sites. (Even though the two sites resolve to different IP's).

Although the original site still displays a solid PR6 in the Toolbar, it is completely non existant in any related SERPS.


 4:54 pm on Aug 2, 2002 (gmt 0)

As a searcher, I don't want to visit the same page several times at different addresses

Of course, that's the reason for the 301 status code. It tells the search engine to remove it so it's not added to the index.

The duplicate domain names are great for other types of marketing campaigns. They other domain names are not intended for search engine marketing, as it's best to use just a single domain in those instances.

The job of finding exact page duplicates is actually rather easy, and not as big a task as might be imagined

I understand. I think it's just good to be sure any alternate domains use the 301 redirect. That should eliminate confusion and problems.

Does that take hosting for the additional domain names

I believe this could be handled in an htaccess file. the domain name is passed in the http header. Any htaccess experts care to comment on how to check this?

On NT and 2000 it's trivial.

By the way, great responses. Thanks.

Richard Lowe


 6:27 pm on Aug 2, 2002 (gmt 0)


Add to .htaccess :

# Redirect non-standard incoming domains to www.mydomain.tld
RewriteCond %{HTTP_HOST} !^www\.mydomain\.tld$
RewriteRule ^(.*)$ [mydomain.tld...] [L,R=permanent]


Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved