Page similarity checker

Forum Moderators: open

Message Too Old, No Replies

Page similarity checker

How to avoid Google duplicate page filter

spinoza

2:41 pm on Dec 24, 2002 (gmt 0)

Since Google penalizes for page similarity, I was wondering if anybody knows of a tool that checks for page similarity. And how similar must two pages be to trigger the Google filter.

I am generating a couple of cross-reference pages from a 'product' database and this results in pages that are almost similar because the products themselves are already very similar.

Maybe GoogleGuy can help

piskie

2:47 pm on Dec 24, 2002 (gmt 0)

I bet he won't.

heini

2:51 pm on Dec 24, 2002 (gmt 0)

Spinoza, there can't be a way of automatically checking the "right" measure of similarity, simply because the exact params used by Google are not known.
A safe bet anyhow would be to block Google from all of the similar pages except one instance.

piskie

2:55 pm on Dec 24, 2002 (gmt 0)

Theres a free app called keyword extractor from Analogx that might be of use.

It counts the occurences of all words in zones such as :
title
body
description
etc

lazyz

3:04 pm on Dec 24, 2002 (gmt 0)

Since Google penalizes for page similarity

Based on my experience, Google penalizes for page duplication, not similarity....

piskie

4:10 pm on Dec 24, 2002 (gmt 0)

Lazyz some people take a page and change a few things to avoid a duplication penalty.

If for example a single word was changed, it is no longer identical but ever so very very similar but I bet Google treats it as identical. How much needs to be changed before Google treats it as a page in its own right and not a duplication.

Between identical and similar, there is a grey are to cross over but who knows how wide that gray area is?

twin65

4:13 pm on Dec 24, 2002 (gmt 0)

I have 1 database-driven website (12.000 pages). I copied the database to another domain (using the same templates) and changed the pages a bit (about 10-20%:different metas, a few other links and some plain text)I was a bit worried, but all of the pages showed up well in the search results.

so similarity (about 80%) works for me, but if you duplicate 1-1, you could be in trouble....

Go60Guy

4:28 pm on Dec 24, 2002 (gmt 0)

There always seems to be quite a bit of speculation about this topic. There have been a number of discussions about syndicated materials which are often reproduced multiple times throughout the web.

I seem to recall reading that these do not draw a penalty simply because they're wrapped in different designs and navigation schemes.

It would certainly help if we could gain greater clarity concerning what is and what isn't regarded as page duplication.

soapystar

5:04 pm on Dec 24, 2002 (gmt 0)

the incredible thing is...this thread is in fact duplicate content..the exact same word for word question was just asked at another google forum....i think we may be in the company of the world duplicate content champion!

piskie

5:22 pm on Dec 24, 2002 (gmt 0)

Soapystar so tell us how good we are, were the answers the same on the other forum?

jimbeetle

5:39 pm on Dec 24, 2002 (gmt 0)

soapystar, yep, think we addressed basically the same thing in a thread about syndication. And why not, I'll duplicate my reply, kind of.

It sometimes seems that with our obsession with Google's rules we sometimes get too caught up with the theoretical and ignore what we actually see in front of us everyday.

Most times when I search on Google I find duplicate content. And, depending on topic, it can be duplicated on quite highly-placed pages. Whether reprints of articles, extracts of dissertations or historical documents it's all there plain to see. The snippets are often much the same.

Just did a search for a popular market newsletter by a well-known web positioning company (this ain't no plug). Most of the pages returned are identical or real close to it. And I found the first PR0 result somewhere around page 40 of the SERPs.

Now, I'm not saying that Google doesn't penalize for duplicate content, but they don't appear to penalize ALL duplicate content.

Don't know what the threshold is: almost identical doorway pages on the same site; identical pages on related sites (interlinked?, same server?); one identical page on multiple domains might be okay, two aren't? Who the heck knows?

I'm in the process of designing a site optimized for Google and -- no matter what I say above -- though the underlying content is the same, it won't read the same nor look the same.

Jim

soapystar

5:46 pm on Dec 24, 2002 (gmt 0)

"soapystar, yep, think we addressed basically the same thing in a thread about syndication. And why not, I'll duplicate my reply, kind of. "

nope..the thread was posted on another site totally....

"Soapystar so tell us how good we are, were the answers the same on the other forum?"

this forum kind beats the other thread out of site.... ;)

btw..someone did post some info about search engines looking for a minimum of 8-13% difference between pages..

spinoza

5:56 pm on Dec 24, 2002 (gmt 0)

Thanks for your replies. I've dowloaded the analogx tool. It's not really what I'm looking but it's a nice tool in it's own right.

I am guessing that Google must be a little bit smarter than just counting word frequencies, and also takes into account structural similarities. There is info available at the web about page similarity check algorithms and I was hoping someone would have converted one of these algorithms into a tool for all of us to use.

But since we don't know which algorithm Google uses and neither the magic similarity percentage that triggers a penalty, the use of such a tool would be limited beforehand.

WebGuerrilla

6:47 pm on Dec 24, 2002 (gmt 0)

This topic came up during a panel at SES in Dallas. The attending google engineer said that
exact duplication will not cause you any problems. The only thing that will happen is Google will not display both pages in a SERP.

Near duplication can cause problems.

A year ago, Google definitely roled out an overly aggressive duplication filter, but I haven't seen any signs over the last 8 or 9 months that that filter is still being used.

There is simply too much natural occuring duplicate content on the web.

However, going after near duplication is a different story. A large group of pages that are very close to being identical, probably look that way because someone has intentionally altered them so that they wouldn't be an exact duplicate. And that is the kind of actions that generally contribute to a poor search experience.

So, you you should either make sure that your pages are an exact duplication, or significantly different.

spinoza

7:55 pm on Dec 24, 2002 (gmt 0)

Thanks WebGuerrilla. This comes as a big surprise to me. If the 100% duplication is allowed *within* a site I can understand (makes no sense anyway), but I wouldn't expect it to be allowed *between* sites. Do you remember if the discussion was about duplication within or between sites?

Also did the ' GoogleGuy ' mention any percentages when talking about 'slightly' or 'significantly' different?

Thanks again,
S.

WebGuerrilla

8:28 pm on Dec 24, 2002 (gmt 0)

Actually, I think that duplication between sites is far more common. Many sites have multiple domains that will resolve to a single site. Some are misspellings, while others might be promotional domains used in print media.

That being the case, it is quite common for Googlebot to stumble across some of these additional domains. That causes them to reindex a site that they've already crawled under a different domain.

>>Also did the ' GoogleGuy ' mention any percentages when talking about 'slightly' or 'significantly' different?

Of course not. :)

Hardwood Guy

8:42 pm on Dec 24, 2002 (gmt 0)

Now you guys have me worried. I have two sites with one page on each that is almost a duplicate but all the others are completely different. Guess I'll just hold my breath?

jimbeetle

9:06 pm on Dec 24, 2002 (gmt 0)

Hardwood_Guy,

I think that what Google and other SEs would be looking for is not the duplicate pages between or among websites, as WebGuerrilla states, much of this occurs naturally (syndicated reports, affiliate product descriptions, press releases, etc.).

What they are most probably after are the "doorway" pages that are essentially the same with just a tweak here or there to appeal to different SEs or target slightly different keyword phrases. There was a huge push on this a few years back when SEO wannabes were pumping out 10s & 100s of these pages per site.

Look at your pages. If they look, feel or smell spammy, well, then they might pose a problem. If they serve a legitimate purpose you're probably okay.

And again, just in your normal searches of Google, notice how many dupe pages are indexed and rank quite highly.

Jim

snowfox121

9:26 pm on Dec 24, 2002 (gmt 0)

Just wondering about "near duplicate" pages on my own site. Do they entail a risk?

Let's say i have some pages, <page.html>, <page1.html> etc, where people register for my services. The trouble is that lots of people find registration confusing. They lose track of where they are.

So i am trying to create an alternative experience for my visitors whereby they can choose to be led through a serious of interconnected pages with special navigation guides. The material on these interconnected pages is identical to the <page.html>, <page1.html> except for navigation aids. (they might be named <pagea.html>, <pagea1.html>, etc.)

These pages would all exist on the same server in the same root directory. Is this the sort of "duplication" that i could be penalized for?

vmaster

10:40 am on Dec 25, 2002 (gmt 0)

Many different sites sell the same brands and products, with the same product names and descriptions, which are often provided by the manufacturers themselves. So, one can naturally not be penalised for having "similar content". However, 100% duplication of the site layout and content etc. is obviously a NO NO.

Hardwood Guy

1:26 pm on Dec 25, 2002 (gmt 0)

Thanks for the responses folks:) I don't consider them to be spammy at all as I don't sell a product but a service. I've even had one ask me it took them awhile to figure out what I was trying to convey on my sites.

Namaste

2:35 pm on Dec 26, 2002 (gmt 0)

Never come across Google penalising for anything. Is penalising myth or reality? Has anyone here ACTUALLY been penalised for duplicate content?