Forum Moderators: open
I mean if 2 urls give you *exactly* the same content...
I would like to know this one too. I have many pages that show each merchant separately. The title, description, keywords and the body is the same, but, the name changes depending on what the merchant is.
For example: Amazon.com blah blah, Amazon.com blah
and Buy.com blah blah, Buy.com blah.
In some cases I have two links /pages, one for Amazon and one for amazon.com (the .com difference reflects on the title, desc, keywords and body as well. About 10 -15 mentions between all 4 places).
Can I improve things by doing it differently?
Can I improve things by doing it differently?
If it works than what's the point in improving?
As far as I see 'blah' in your case is the same product it's just that it's described on different pages
What you are doing is a very common way of content duplication.
But it works with amazon products just as I know from the experience of one of my competitors.
> If it works than what's the point in improving?
improve to do better :), get more visitors /money.
> As far as I see 'blah' in your case is the same product > it's just that it's described on different pages
actually it's more like saying, amazon.com /buy.com /storename.com offers, discounts etc. not products (not that it makes a big difference)
We are talking about over 400 pages of completely identical content... including graphics. Only the links to were changed to the new domain.
Personally I believe that Google is not too concerened with duplicate content, unless near identical pages are on the same page of the SERP's and even then...... Google is very adamant on being quick loading and I would say that any duplicate content check would slow things down quite a bit.
In addition to that they are concerned with similar content (mainly on the same domain). What we have seen is that they even got stricter on how close a similar page may be to another page.
We found many pages with only the title and the URL in the SERP that have been fully indexed before. If we have 15 pages about different kinds of widgets it indexes one of them fully and the other 14 only show title/URL.
Part of all 15 pages is the same content and this behaviour of filtering out similar pages is there on different domains.
Does anyone have a clue about *how* Google decides which pages are similar to others?
I don't know how google's dup filter works but found some very interesting threads:
- Duplicates and the challenges search engines face [webmasterworld.com]
Starting point for understanding how duplication is detected. - May, 2003
- How does Google determine dup. cont. anyway? [webmasterworld.com]
I see three or four copies of one page from my site . . . - June, 2003
- Google and duplicate Content [webmasterworld.com]
A few comments about duplicate content and google - August, 2002
There is a search time filter that trys to avoid having the same page on different sites filling the SERPs. We know that this one is there, because GG has told us how to override it. (I don't recall how, because I don't have any duplicate content to worry about)
Then there seems to be a filter that goes and tries to find garbage pages (often duplicates) in link farms.
What they absolutely do not do is compare every page on the web with every other page. That would require
n = number of pages in the index
n * ((n-1)/2) page comparisons.
If they have 5 billion pages in the index, this would require 12,499,999,997,500,000,000 page comparisons. That simply ain't happening.
I guess it would be a penalty for EXACT MIRRORED Pages, without a single word different (text).
whatever the original question was, Guys are much interested in knowing "if dupe content is caught, which page/site will be kicked out?".
one more thing...
if a site A of 100 pages have 5 pages just like B, and google see thtis... would the whole site suffer dupe content penalty?
But there is a difference between a filter and a penalty.
You may feel like you are being penalized by a filter, but you are just being removed from that specific result. A penalty will hurt you (but not necessarily remove you) across all results.
If you do not understand the difference, you will be ill-equipped to deal with it.
Just because google does not compare every page to every other page on the entire web, does not mean that they do not compare pages with a high probability of being similar.
Remember, Google has all your linking information. Links can tell a lot about the content, and content can tell a lot about the links.
If there are 3000 pages all pointing to one page with the same anchor text, they might run their diff program on those pages.
In fact, they do not have to compare them all with each other. They can run through a few pages, and if it appears to be clean, then look elsewwhere. If it is a bunch of cloned pages they will be removing duplicate content from the 3000 pages as they go. If all of them are duplicate, they only need to do 2999 compares to remove 2999 pages.
As to which ones they filter out of the results, I would bet on the one that was crawled first in the current index is the one that stays.
Google has no quick easy fair way to tell who "deserves" to be there, so they might as well go with the easiest answer for them.
"As to which ones they filter out of the results, I would bet on the one that was crawled first in the current index is the one that stays."
>>> suppose I have a site of only 1 page x.htm which was crawled 4 years back. I copy exact content from other 1 page site y.htm , which was crawled few days back.
Google catch the "Dupelicate content"....
would the new site be kicked out as it was crawled later?
On the other hand, a crawl from just a few days ago is only likely to be included as a "fresh" page, and fresh pages are stored differently and are (most likely) merged into the results at search time.
You seem sure that it is a penalty, and not a filter. Why do you believe this?
You also don't give any reason that you suspect that it is the dup content.
When I checked your profile to see if your home page was listed, you put LOL-that-would-be-a-disaster which suggests to me that you know that you are quite possibly crossing several lines. You also mention that you might be part of a bad neighborhood.
I just end up with this hunch that dup content, while part of your problem, is far from the only reason your site is sitting at PR0. With all that going on, you might have even failed a manual check of your site.
"Oh yeah, I should add that they might do something link an MD5 on all the pages and sort on that, then compare only those that have the same MD5, but then they will only remove those that are 100% matches"
I think that a variation of this might be correct:
Parse the DOM, concatenate the actual heavy bits of content (paragraphs?) together, checksum them, and then sort.
Thanks for all the guesses, by the way - the more and the less educated ones..
What mystifies me is why duplicate pages with identical titles, meta descriptions, and keywords appear right above each other in the SERPS.
What annoys me is that the guy who stole and copied my entire site not only appears above me in all the SERPS, but caused my site to be penalized for duplicate content (PR dropped from 6 to 3). I guess I'll have to wait for the next update to recover... even though the offending site is now shut down... Google won't pull these copied pages out of the SERPS for a month.
In terms of figuring out which page was the original, and which page was the duplicate, you might think that Google could do a little better, since most of these pages had not changed in a couple of months, and Google still had them in their cache. It would certainly be better if someone copies your pages that THEY get penalized for duplicate content... not you, the author.
I've been researching the duplicate content thing as well and came to the conclusion G had to use the related search feature seeking out duped pages. It is also possible they identify duped content dynamically, e.g. take the first 1000 results from the SERPs, then filter them again for duped content.
Although we'll never have a sample large enough to obtain absolute certainty, I'm pretty sure it comes down to this.
Stage 1 - normalize HTML, hash it and compare. Identical pages are found immediately.
Stage 2 - classic udiff comparison, difference beyond a pre-defined threshold signifies identical pages.
Stage 2 - parse HTML into a DOM tree, normalize it, use some kind of fuzzy logic to cross-relate the elements and compare the tree structure and the content of elements.
Result is a qualifier that mathematically describes the likelihood of two pages being content-identical.