Forum Moderators: Robert Charlton & goodroi
After just one results I got the standard Google message -
In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.
I've seen this message before but I don't recall it blocking different websites. I thought it was used to block the same page /content within a site.
Is this new? It seems like a ridiculous considering the sites have nothing in common except that line.
Technically, it may not be duplicate pages, and there is another paper (or patent application) that's more relevant, and deals more with query specific clustering and filtering, but this one gives a good idea of the concept:
Detecting query-specific duplicate documents [patft.uspto.gov]
The idea is that "portions" of pages are being used, what other papers sometimes refer to as "footprints" or "fingerprints."
[edited by: Marcia at 4:32 am (utc) on Dec. 18, 2007]
That's gonna put a real damper on electronics stores that all have essentially the same item descriptions (content) for many of their pages.
If someone is searching on the words that are in that duplicated copy, then they probably don't need a full selection of websites where it appears - at least not on the first result set. But if their query is more like a brand name and model number, then because the duplicate detection is query-dependent, the filtering of results would also be different and usually much more relaxed.
...a final set of query results, wherein the final set of query results is a sub-set of the ranked query results, and wherein the final set of query results does not include any two query results corresponding to documents that have similar query-relevant parts....
Google's treatment of dupe content is very definitely query related, and it also seems to relate to how a site might be linked. Inbound links related to a query will mitigate what's seen as duplicate content. (You might say that this, in a way, reflects the off-page portion of Google's ranking algo).
I've seen scraped pages temporarily get filtered out as dupes for, say, a full ten-word sentence in quotes that doesn't contain any "optimized" terms, yet continue to rank as normal for a competitive phrase for which it had good inbound anchor text.
I also continue to see multiple copies (that are pretty much identical) of authoritative articles rank on different domains for competitive searches if both copies have good backlinks.
I'm so tempted to force removal of that post by switching those hotlinked images with something much less savoury.
The interesting thing is, that forum page is not accessible unless someone is registered and logged into that site. There is currently no cache of the page, although there was one showing 2 days ago.
[edited by: ChicagoFan67 at 8:49 am (utc) on Dec. 18, 2007]
It's what I call the "similarity" filter, and it isn't new. It'll block pages with the same page title/meta description/page_top text (and/or, most usually "and"), no matter what site they're on.
I do remember Maria talking about this before but I thought it wouldn't come down to sentences. Am I right in saying that famous quotes, presidential speeches etc will be included in this filter?
You know there are sites/programs that "track" text, to see who if anyone lifts content from other sites. Why can't Google create code to do the same thing or buy the company?
It has cached site pages, so why doesn't it compare the cached text, and stop rewarding thieves with top SERPs!?!
Hint to Google: if one site has content before another, the one that got it later might be the thief.
p/g
Why does Google still find it so difficult to figure out who is the original source of text, and who copied it? Is this really an engineering nightmare?
Using the OP's quote as an example, what if the director of Liar Liar started a website and put that quote on there - how would google know that website should be considered the original source of the text?
I think not, albeit the filters may be applied in different ways to different situations. I think it should all be called "duplicate content" in " Webmaster language " to avoid confusion.
If a block of text is identical, then it's a duplicate block of content. Google will often filter it as described.
If a page contains the same block of text amongst other differing text , then it may not be considered duplicate, depending on the proportion on the page / URL.