Dupe Content Detection

Forum Moderators: open

Message Too Old, No Replies

Dupe Content Detection

milez

12:35 pm on Dec 28, 2003 (gmt 0)

Hello WW.

Does anyone have a clue about *how* Google decides which pages are similar to others?

Does it go all the way and checksums the text itself, or just stay at the page level and compares the the file names\checksums\size of the page as a whole?

Pavel.

ciml

2:28 pm on Dec 29, 2003 (gmt 0)

Hello pavel. Google will merge two URLs if they are found to be identical, removing one URL and crediting the other with both sets of backlinks and the PR from both. Google also omits similar pages, but you can see them if you click the 'repeat the search with the omitted results included' link or append &filter=0 to the results URL.

wanna_learn

6:37 pm on Dec 29, 2003 (gmt 0)

"removing one URL and crediting the other with both "
Hows that decided which one to remove and which one to credit?
Thats the soul of this issue.

hobbnet

6:40 pm on Dec 29, 2003 (gmt 0)

I believe the page with lower PR is removed.

wanna_learn

6:46 pm on Dec 29, 2003 (gmt 0)

"I believe the page with lower PR is removed."
That would be a blunder!

specially those webmasters who put their efforts in writing content instead of swapping Links would be removed when there quality content is copied by other site of high PR.

WebmasterFisherman

7:17 pm on Dec 29, 2003 (gmt 0)

Doesn't it consider such things as *duplicate content* and just penalizes your site (page) if it feels like a bit of bloodshed?

I mean if 2 urls give you *exactly* the same content...

Kirby

7:40 pm on Dec 29, 2003 (gmt 0)

>Google also omits similar pages,

How does it define 'similar'? When does 'similar' become duplicate content?

walkman

8:16 pm on Dec 29, 2003 (gmt 0)

"How does it define 'similar'? When does 'similar' become duplicate content"

I would like to know this one too. I have many pages that show each merchant separately. The title, description, keywords and the body is the same, but, the name changes depending on what the merchant is.

For example: Amazon.com blah blah, Amazon.com blah
and Buy.com blah blah, Buy.com blah.

In some cases I have two links /pages, one for Amazon and one for amazon.com (the .com difference reflects on the title, desc, keywords and body as well. About 10 -15 mentions between all 4 places).

Can I improve things by doing it differently?

WebmasterFisherman

8:36 pm on Dec 29, 2003 (gmt 0)

Can I improve things by doing it differently?

If it works than what's the point in improving?

As far as I see 'blah' in your case is the same product it's just that it's described on different pages

What you are doing is a very common way of content duplication.

But it works with amazon products just as I know from the experience of one of my competitors.

dvduval

8:42 pm on Dec 29, 2003 (gmt 0)

We may be talking about two different things:
1) Duplicate content on the same site
2) Duplicate content on different sites

When Google removes the lower ranking page, that would be on the same site.

walkman

10:40 pm on Dec 29, 2003 (gmt 0)

Hi, thanks for the reply

> If it works than what's the point in improving?
improve to do better :), get more visitors /money.

> As far as I see 'blah' in your case is the same product > it's just that it's described on different pages
actually it's more like saying, amazon.com /buy.com /storename.com offers, discounts etc. not products (not that it makes a big difference)

DVDBurning

11:59 pm on Dec 29, 2003 (gmt 0)

Personally, I'm not sure Google has much of a "unique content" or "duplicate site" filter in place. I had my entire website copied, and I only found it by noticing that the offending site ranked higher than mine for nearly every search term. I was still ranked, right below my stolen pages. This was more than one month ago, and I submitted SPAM and copyright complaints to Google. I finally got the site offline, by going to the web hosting company... but Google still has all of these copied pages in the index.

We are talking about over 400 pages of completely identical content... including graphics. Only the links to were changed to the new domain.

antrat

12:01 am on Dec 30, 2003 (gmt 0)

milez, I doubt anyone here knows the answer to your question. As you have seen you will get educated guesses through to stating the obvious. It's probably one of Googles closely guarded secrets.

Personally I believe that Google is not too concerened with duplicate content, unless near identical pages are on the same page of the SERP's and even then...... Google is very adamant on being quick loading and I would say that any duplicate content check would slow things down quite a bit.

zgb999

10:22 am on Dec 30, 2003 (gmt 0)

Google is very much concerned with duplicate content. Otherwhise they would have several times as many pages in their index as they do.

In addition to that they are concerned with similar content (mainly on the same domain). What we have seen is that they even got stricter on how close a similar page may be to another page.

We found many pages with only the title and the URL in the SERP that have been fully indexed before. If we have 15 pages about different kinds of widgets it indexes one of them fully and the other 14 only show title/URL.

Part of all 15 pages is the same content and this behaviour of filtering out similar pages is there on different domains.

tribal

10:42 am on Dec 30, 2003 (gmt 0)

I agree with DVDBurning, it doesn't really check duplicate content. However, I suspect the topicstarter means two identical sites, with different domain names, right?

If so, I believe G uses the "related:" command to identify related pages with almost (identical) content.

wanna_learn

7:23 pm on Dec 30, 2003 (gmt 0)

Can any Senior put a light on this topic?

Yidaki

7:43 pm on Dec 30, 2003 (gmt 0)

The original question was about the HOW:

Does anyone have a clue about *how* Google decides which pages are similar to others?

I don't know how google's dup filter works but found some very interesting threads:

- Duplicates and the challenges search engines face [webmasterworld.com]
Starting point for understanding how duplication is detected. - May, 2003

- How does Google determine dup. cont. anyway? [webmasterworld.com]
I see three or four copies of one page from my site . . . - June, 2003

- Google and duplicate Content [webmasterworld.com]
A few comments about duplicate content and google - August, 2002

BigDave

8:16 pm on Dec 30, 2003 (gmt 0)

I don't know how the filter works. But I do suspect that they have several different ones.

There is a search time filter that trys to avoid having the same page on different sites filling the SERPs. We know that this one is there, because GG has told us how to override it. (I don't recall how, because I don't have any duplicate content to worry about)

Then there seems to be a filter that goes and tries to find garbage pages (often duplicates) in link farms.

What they absolutely do not do is compare every page on the web with every other page. That would require

n = number of pages in the index
n * ((n-1)/2) page comparisons.

If they have 5 billion pages in the index, this would require 12,499,999,997,500,000,000 page comparisons. That simply ain't happening.

BigDave

8:18 pm on Dec 30, 2003 (gmt 0)

Oh yeah, I should add that they might do something link an MD5 on all the pages and sort on that, then compare only those that have the same MD5, but then they will only remove those that are 100% matches.

wanna_learn

8:47 pm on Dec 30, 2003 (gmt 0)

How come Duplicate Penalty so much talked about then?

I guess it would be a penalty for EXACT MIRRORED Pages, without a single word different (text).

whatever the original question was, Guys are much interested in knowing "if dupe content is caught, which page/site will be kicked out?".

one more thing...
if a site A of 100 pages have 5 pages just like B, and google see thtis... would the whole site suffer dupe content penalty?

BigDave

9:09 pm on Dec 30, 2003 (gmt 0)

I know that some of you will argue this point with me, you always do.

But there is a difference between a filter and a penalty.

You may feel like you are being penalized by a filter, but you are just being removed from that specific result. A penalty will hurt you (but not necessarily remove you) across all results.

If you do not understand the difference, you will be ill-equipped to deal with it.

Just because google does not compare every page to every other page on the entire web, does not mean that they do not compare pages with a high probability of being similar.

Remember, Google has all your linking information. Links can tell a lot about the content, and content can tell a lot about the links.

If there are 3000 pages all pointing to one page with the same anchor text, they might run their diff program on those pages.

In fact, they do not have to compare them all with each other. They can run through a few pages, and if it appears to be clean, then look elsewwhere. If it is a bunch of cloned pages they will be removing duplicate content from the 3000 pages as they go. If all of them are duplicate, they only need to do 2999 compares to remove 2999 pages.

As to which ones they filter out of the results, I would bet on the one that was crawled first in the current index is the one that stays.

Google has no quick easy fair way to tell who "deserves" to be there, so they might as well go with the easiest answer for them.

wanna_learn

9:22 pm on Dec 30, 2003 (gmt 0)

At the first place, I was refering penalty and not filter.
I have an example where 2 sites from same IP had 15-20% SAME content pages and that triggered penalty (not Filter).

"As to which ones they filter out of the results, I would bet on the one that was crawled first in the current index is the one that stays."
>>> suppose I have a site of only 1 page x.htm which was crawled 4 years back. I copy exact content from other 1 page site y.htm , which was crawled few days back.

Google catch the "Dupelicate content"....
would the new site be kicked out as it was crawled later?

BigDave

9:28 pm on Dec 30, 2003 (gmt 0)

The crawl from 4 years back would not be in the current index. The current index is made from recent crawls by googlebot.

On the other hand, a crawl from just a few days ago is only likely to be included as a "fresh" page, and fresh pages are stored differently and are (most likely) merged into the results at search time.

You seem sure that it is a penalty, and not a filter. Why do you believe this?

wanna_learn

9:34 pm on Dec 30, 2003 (gmt 0)

BigDave,
cuz the site dropped to 0 PR from 5 and also disappered from SERP entirely.
(I wonder if that was bad neighbourhood penalty, but more sure that it got caught for Dupe content).

BigDave

9:48 pm on Dec 30, 2003 (gmt 0)

Are you just gone from the SERPs or are you gone from the index?

You also don't give any reason that you suspect that it is the dup content.

When I checked your profile to see if your home page was listed, you put LOL-that-would-be-a-disaster which suggests to me that you know that you are quite possibly crossing several lines. You also mention that you might be part of a bad neighborhood.

I just end up with this hunch that dup content, while part of your problem, is far from the only reason your site is sitting at PR0. With all that going on, you might have even failed a manual check of your site.

milez

10:38 pm on Dec 30, 2003 (gmt 0)

BigDave said:

"Oh yeah, I should add that they might do something link an MD5 on all the pages and sort on that, then compare only those that have the same MD5, but then they will only remove those that are 100% matches"

I think that a variation of this might be correct:
Parse the DOM, concatenate the actual heavy bits of content (paragraphs?) together, checksum them, and then sort.

Thanks for all the guesses, by the way - the more and the less educated ones..

DVDBurning

4:00 am on Dec 31, 2003 (gmt 0)

I think duplicate pages that are on the same site would be easiest to find, followed by pages that are on sites that link to one another. However, duplicate page content will be optimized for the same search phrases, so they show up in the same SERPS.

What mystifies me is why duplicate pages with identical titles, meta descriptions, and keywords appear right above each other in the SERPS.

What annoys me is that the guy who stole and copied my entire site not only appears above me in all the SERPS, but caused my site to be penalized for duplicate content (PR dropped from 6 to 3). I guess I'll have to wait for the next update to recover... even though the offending site is now shut down... Google won't pull these copied pages out of the SERPS for a month.

In terms of figuring out which page was the original, and which page was the duplicate, you might think that Google could do a little better, since most of these pages had not changed in a couple of months, and Google still had them in their cache. It would certainly be better if someone copies your pages that THEY get penalized for duplicate content... not you, the author.

tribal

10:09 am on Dec 31, 2003 (gmt 0)

DVDBurning: try to do a related search on their site. If your site or sites linking to the penalized site come up in the SERPs I'd be almost sure this is where your problem lies.

I've been researching the duplicate content thing as well and came to the conclusion G had to use the related search feature seeking out duped pages. It is also possible they identify duped content dynamically, e.g. take the first 1000 results from the SERPs, then filter them again for duped content.
Although we'll never have a sample large enough to obtain absolute certainty, I'm pretty sure it comes down to this.

Nova Reticulis

11:06 am on Dec 31, 2003 (gmt 0)

I don't know how Google does it but here is how I would do it:

Stage 1 - normalize HTML, hash it and compare. Identical pages are found immediately.

Stage 2 - classic udiff comparison, difference beyond a pre-defined threshold signifies identical pages.

Stage 2 - parse HTML into a DOM tree, normalize it, use some kind of fuzzy logic to cross-relate the elements and compare the tree structure and the content of elements.

Result is a qualifier that mathematically describes the likelihood of two pages being content-identical.

steveb

11:46 am on Dec 31, 2003 (gmt 0)

Just noticed something for one keyphrase...

On five datcenters the first result is www.widget.com

On four datacenters (-gv currently offline) the first result is widget.com, with no www.

All the other results (in terms of showing www always or never) are consistent across all datacenters.

This 44 message thread spans 2 pages: 44