Forum Moderators: Robert Charlton & goodroi
My understanding is that this filter is sitewide rather than page by page, and that it can only be removed by top level Google people - do others see this as a problem for many sites?
It is on a page basis, not domain but if two or more sites carry mostly the same content then often you'll see that the more powerful site wins across many searches.
<snipped an irrelevant point>
I mean the site that tends to have more powerful pages. ;-)
I don't know if that's the URL with higher PR, or the URL that is listed higher without the near-dupe filter.
> How do you define "wins"
The page that is listed even without clicking &filter=0
Then they realise the filter brought down the scrapers but also damages legit sites and does a little reduction on the filters but the scrapers start floating to the top again.
The supplemental portion of the algo is probably still buggy and I see some authority websites getting alot of supplemental pages.
I mean the site that tends to have more powerful pages. ;-)I don't know if that's the URL with higher PR, or the URL that is listed higher without the near-dupe filter.
I can absolutely confirm that it is not the higher PR pages that are winning. My personal theory is that the pages being displayed are getting there by hitting the semantically related kw mix best, and that there are also other sitewide factors now clearly coming into play.
I have a feeling this removal is not automatic but manual - any thoughts on that?
Though it may not seem so on first glance, there is a related thread here [webmasterworld.com] in the Supporters Forum. (If you haven't ponied up to be a Supporter yet, and you're dealing with this issue, now would be a good time.) ;-)
I have no clue how you would know what 'most sites' do unless you maintain a hecka large db that's kept very fresh, but in any event this is not the case where we're looking.
FWIW, my guess is that it won't stay this way for too long, simply because whatever dials have been turned is resulting in some pretty high quality pages being dropped in favor if pages that are much lower in overall importance but are newer, and/or have guessed right or figured out other key elements of the algo that (IMO) relate to kw's.
Insightful stuff here and in supporter's forum. I think you are on to something with the idea that they have turned up the restrictions beyond what is reasonable in an effort to knock out a higher percentage of scrapers.
I'd suggest it may even be calculated as a way to "purge" the scrapers and assume the legitimate sites will stick it out where the scrapers will bail out quickly, but that's wild speculation on my part.
I can absolutely confirm that it is not the higher PR pages that are winning. My personal theory is that the pages being displayed are getting there by hitting the semantically related kw mix best, and that there are also other sitewide factors now clearly coming into play.
Could your experience be consistent with Google keeping whichever page would score highest for a search without the near-dupe filter?
I think I see an easy way to check for that. Find two pages where one is behind &filter=0, pick searches that would favour one or the other, see if the remaining URL with &filter=0 in place switches according to the search.
Will semantics come into play before the CRC? or fingerprinting of the web content? Say similiar terms like price, value, cost, worth (I know this is a bad example) are they the same or considered close enough for duplicate content in a page?
Will google decide on a default and apply the filter upon rating the full content of the page for similiarity before ranking.
There are pages on my site that is obviously exact terms found elsewhere but doesn't get filter but there are others that has only a small portion of duplicate content and goes into supplemental results. There are pages with no duplicate contents that goes into supplemental results too.
Other than buggy, it could be google has 'shorted' itself out. Just ban the scrapers! hehe.
...and that there are also other sitewide factors now clearly coming into play.
Yes, I think that the 'other sitewide factors' are very much utilized in the new SERP's too.
Unfortunately, Google has become much elitist in nature, favoring 'authorities' in any field.
E.g. Bush mentioning 'spirit' would make him an expert in spiritual matters and philosophy @Google.
Anyway, IMO PR is still much important for a total sitewide score.
The thing is the site's 'reputation' (according to google that is) comes first before the relevancy of the page thus even if your page is non-duplicate content, you will still rank lower, much lower than those that has no or little duplicate content (again according to google) site wide.
CIML, as is so often the case when dup issues come up, I think there are multiple sorts of issues being discussed in this thread, partly my fault. If joeduck was referring only to pages that reappear when the &filter=0 is used, then I've been OT part of the time, because I'm actually exploring what appear to be two similiar but not necessarily related examples, only one of which brings all pages up with the &filter=0 search.
Case One
This is actually a variety of cases, but all behaving similarly. The &filter=0 always adds the missing page back, sometimes under the winner, sometimes above the winner. But indented results are factoring in here too, and the differences between result sets are subtle. So much so that it is hard to work out what the determining factor(s) are.
But actually, I'm wondering if a lot of what people are discussing re dup content in various threads right now involves searches where the &filter=0 does not bring the missing pages back, as in the following:
Case Two
The filter does not help the vanished pages show. The vanished pages had previously been top performing, with TBPR of 6 (homepage TBPR 7). Now, similar (competitive) pages are in the place of our vanished pages. The competitive pages have TBPR of 4. Notably, the competitive pages stole part of what appears on our pages, though the competitive pages are much newer and show far lower TBPR.
I don't think that this is a hand check, partly for reasons I can't discuss, and partly because our vanished pages are more useful than the newer competitive pages showing in their place.
I don't know what the latest thinking is on &filter=0 but IMO, this only shows 'acceptable, similar, related' pages that were not shown on the filtered search. It's still very possible to be filtered for duplication of various kinds and not show up with the &filter=0 in place.
Marcia has an interesting theory that she posted in the Suporters thread I referenced in msg#11 above. If I understand her correctly, she's wondering if perhaps some dup filtering is occuring at a different time than before, possibly in the BlockRank phase, if they're even using that (I'd guess they are). The more I think about her theory and the way she connects it to the original backrub paper, the more convinced I am that it would explain what I'm seeing here. But as I noted over there, if they're doing this, it seems to be a very blunt instrument approach to an issue that rests on subtleties (near duplicated pages, or stolen snippets of content).
This might account for some of the many recent complaints about quality subpages vanishing in the Bourbon update.
Marcia has an interesting theory that she posted in the Suporters thread I referenced in msg#11 above. If I understand her correctly, she's wondering if perhaps some dup filtering is occuring at a different time than before
The more I think about her theory and the way she connects it to the original backrub paper, the more convinced I am that it would explain what I'm seeing here.
But as I noted over there, if they're doing this, it seems to be a very blunt instrument approach to an issue that rests on subtleties (near duplicated pages, or stolen snippets of content).This might account for some of the many recent complaints about quality subpages vanishing in the Bourbon update.
In one case, MY site was in the "similar results" list; in another it way way down in the listings. I don't know what criteria they use, but PR and inbound links from external sites and the age of the text don't seem to be among them. The date-time stamp of the page, maybe, I've wondered about that, but rewarding webmasters for not keeping their sites up to date would be a curious way to run a search engine.
1. Spam with lots of duplicate pages.
2. Repetition, e.g. Google's 1460 pages about "Googol,Milton etc."
[search.yahoo.com...]
3. Specialist niche site with a page about
"fluffy yellow widget pink trim",
"fluffy yellow widget yellow trim",
"fluffy yellow widget blue trim",
and so on.
So it tells you nothing about quality, relevence or anything useful.
Its funny hearing engineers cover for Google with complicated discussions, but to paraphrase Googles own quality guides, "Another test would be 'would I feel comfortable explaining this to other search engines'?".
Here search for "google googol milton"
[google.com...]
The first entry from Google.com comes in at 56 and its a FRENCH page and I'm searching in ENGLISH. Google has determined that the best page to present to me from its site is the French one and that Chinese pages (position 29) should outrank that.
Now lets run the test on other engines:
[search.yahoo.com...]
[search.msn.com...]
[walhello.info...]
[kanoodle.com...]
Google site is top. Only Vroosh lets us down by showing it at 2nd.
Clearly they have messed themselves and the only people to clean up the mess is Google.
If joeduck was referring only to pages that reappear when the &filter=0 is used, then I've been OT part of the time, because I'm actually exploring what appear to be two similiar but not necessarily related examples, only one of which brings all pages up with the &filter=0 search.
You are totally on topic as far as I'm concerned Caveman. I've noted both the cases you describe, but assumed (wrongly/simplistically?) that they are related to a sitewide downgrade that somehow devalues all results from the unlucky site.
Some of those devalued pages make it into "omitted results" and some do not. I'm not sure I'm understanding Marcia's observations but will review her posts here and in supporters. Google is damaging my brain again.
the dup issue as it relates to site-wide issues within a site isn't the same thing as dup issues with external or scraped pages. Particularly when it comes to IDF and/or site-wide anchor text
Marcia - it certainly makes sense that the two would be treated differently. Do you think &filter=0 removes both types? Are there any other filtering terms?
My site uses a navigational menu, but the choices in that menu are they keywords that are also in the title of my pages. Is it possible that that kind of optimized menu could be causing a over optimization penalty or a dup content penalty ( using always the same anchor text?
So what is the best way to make a site wide menu still optimized, but that does not feel spammy to google?
Thanks
The newer dupe factor doesn't remove URLs, but causes the rankings to be lower. This appears to apply to URLs across all searches, and is therefore believed to be applied at index time (or periodically). It helps to keep all Google result sets relatively free of high ranking feed pages.
<<Is it possible that that kind of optimized menu could be causing a over optimization penalty or a dup content penalty ( using always the same anchor text?>>No problem, patchacoutek.
I wouldn't be so sure. It could be that your experience tells otherwise, but I agree with the following:
" Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions, causing the profile of growth in anchor words/bigrams/phrases to likely be relatively spiky.
One reason for such spikiness may be the addition of a large number of identical anchors from many documents. "