Welcome to WebmasterWorld Guest from 220.127.116.11
The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they're throwing out ALL the similar pages.
The result of this miscalculation is that high quality pages from leading/authoritative sites, some that also act as hubs, are lost in the SERP's. In most cases, these pages are not actually penalized or pushed into the Supplemental index. They are simply dampened so badly that they no longer appear anywhere in the SERP's.
The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down.
From early Feb through about mid March, the situation was corrected and the best pages for specific queries were again elevated to higher rankings. When indented listings were involved however, the indented listing seemed now to be less relevant than was the case pre-Dec.
In mid March to about mid May, the situation worsened again, approximately to the problems witnessed in mid Dec., i.e., the most relevant pages dropped 5-10 spots, indents vanished as was the case in Dec.
But the most serious aspect of the problem began in mid May, when G started dropping even the best page for the query out of the visible SERP's.
A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread [webmasterworld.com] I've ever seen.
Why This is Such a Problem
The short answer is, that a lot of very useful, relevant pages, are now not being featured. I'm not talking about just downgraded. They're nowhere.
Now, I'm sure that there are sites that deserved the loss of these vanished pages. But there are plenty of others whose absense is simply hurting the SERP's. There is a difference between indexing the world's information, and making it available after all.
We help a client with a scientific site about insects (not really, but the example is highly analogous). Let's discuss this hypothetical site's hypothetical section about bees. Bees are after all very useful little creatures. :-)
There are many types of bees. And then there are regional differences in those types of bees, and different kinds of bees within each type and regional variation (worker, queen, etc). Now, if you research bees, and want to search on a certain type of bee - and in particular a worker bee from the species that does its work in a certain region of the world, then you'ld like to find the page on that specific bee.
Well, you used to be able to find that page, near the top of the SERP's, when searching for it.
Then in mid Dec, you could find it, but only somewhere in the lower part of the top 20 results.
Now, G is not showing any pages on bees from that site. Ergghh.
What is an Affected Site To Do?
One option, presumably, would be to stop allowing the robots to index the lesser pages that are 'causing' the SE's to drop ALL the related pages. But this is a disservice to the user, especially in an era when GG has gone on record as taking pride in delivering especially relevant results, and especially for longer tail terms.
Should we noindex all the bee subpages, so that at least searchers can find SOME page on bees from this site? (I'm assuming that noindexing or nofollowing the 'dup' pages that are not really 'dup' pages at all would nonetheless free the one remaining page on the topic to resurface; perhaps a bad assumption.)
In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing.
G needs to sort this out. ;-)
Note: Posters, please limit comments to the specific issues outlined in this thread. There are a lot of dup issues out there right now. This is just one of them.
FWIW, IMO, the ongoing presence of the filtered pages in the main index IS the distinguishing factor for this particular kind of internal site dup filtering, which has nothing to do with external pages from other sites.
My guess, and this is only wild speculation, is that G is filtering the kind of pages I've alluded to without any malice or permanent negative consequences. Possibly because they are concerned that the pages filtered in this particular circumstance are in fact decent pages, that they're just choosing for now to exclude from the SERP's.
Brett pointed out in one thread long ago that dup filters typically (at that time) were not the same as penalties. They simply dampened a page's ranking to the point where the effect was comparable to a penalty.
If the filtered pages in this particular situation start going supplemental, then I'll be a lot more upset. That would mark the point where we go from sorting out the issue, to baby with the bathwater time.
MatthewHSE, G IMHO has been employing various forms of internal, sitewide assessment and filtering for a long while now, so the concept is nothing new. This issue, if I have it right (and I may not), is little more than the turning of a dial.
As for going after Joe Beekeeper, I don't believe for one moment that that is G's intent. They're going after spam. The issue I raise, IF it's accurate, is collateral damage.
A couple more critics ... its monday ...
They have definitely coded themselves into a corner. A shame..
Also, why would you spend investor dollars to enable free wi-fi for all of San Fran? I see this as gimicky, frivilous spending .. (no return other than pr)
I would at least like to SEE my site in their SERPs in the top 1000. I dont care if its not on page 1-2-3 or 5 for that matter. Just show that you acknowledge it for certain keywords other than my own domain name.
That would make more sense and also send the message that their code is not in serious trouble like it is.
Beyon search .. where can this company compete. And they are killing their own core competencies.
Time will tell. I think the party with Wall Street may be tapering off as well. Watch and see ...
I would be interested to know if deep links from outside sites helps protect from this issue
I would have to think that if Google finds a page through an outside link then it would have no choice but to list it. That's why I have hundreds of pages indexed that are almost exactly alike with the exception of a few words and query string c= parameters (outside sites have tracking values to mointor production and traffic).
Also, regarding templates. I have noticed that since most template pages use the same title and desrciption, this can be circumvented by using product names as values at the end of a URL to be dynamically added into the meta tags, thereby producing a "unique" page as far as G is concerned.
the resulting title would start as:
Purchase (insert) at Widget Town Today!
and end as:
Purchase Thick Blue Widget at Widget Town Today!
I have seen this work in establishing enough individuality for G to consider this an autonomous page. Of course the page content itself must also be unique enough but I've found that as long as there are 5 or more sections of the page with differing text (only a sentence or so in each section) then this will do. You might want to do something that automatically generates a description tag this way as well. Make a note that I've only studied this method on authority sites with high rankings.
So now if we webmasters post more then page on a topic like 'bees', then Googles thinks we are spamming web page.
That might be true if the pages consisted mostly of duplicate content, but it certainly isn't a true statement in general.
In the "bee" example, Google might think (not without reason) that pages were duplicate content, and that the intent was to spam the index, if all five pages consisted of virtually the same text with just a few words changed here and there. E.g.:
Peruvian honeybee: A black-and-yellow bee with a pointy stinger and hairy feet that buzzes in Spanish, generic text generic text generic text...
Brazilian honeybee: A black-and-yellow bee with a pointy stinger and hairy feet that buzzes in Portuguese, generic text generic text generic text...
Irish honeybee: A black-and-yellow bee with a pointy stinger and hairy feet that buzzes in Gaelic, generic text generic text generic text...
Now, there might be legitimate reasons for using this kind of duplicate content, but Google can hardly be faulted for assuming that such patterns are artificial. And if there's a 97% statistical likelihood that such blatant duplication is the result of aggressive SEO and merely clutters the index with boilerplate content, then is it so unreasonable for Google to simply filter all the pages and rely on reinclusion requests to correct the few instances where the duplicate content might be legitimate and in keeping with Google's stated mission? In situations like the bee examples above, why shouldn't the burden be on the publisher to demonstrate that the duplicate content is legitimate and of value to users?
... is it so unreasonable for Google to simply filter all the pages and rely on reinclusion requests to correct the few instances where the duplicate content might be legitimate and in keeping with Google's stated mission?
Yes it's unreasonable for them to filter out most of the pages, if it helps them battle spam. OTOH, it's bad for the user and the SERP's to filter out all of the relevant pages.
... few instances ...
In situations like the bee examples above, why shouldn't the burden be on the publisher to demonstrate that the duplicate content is legitimate and of value to users?
I did not say that fighting spam was easy. But if G is to retain its standing then they must find ways to control spam without accepting ever growing levels of collateral damage. Sort of obvious on its face, I would think.
Peruvian honeybee / Brazilian honeybee / Irish honeybee
Think more in terms of "Peruvian honeybee - Queen" / "Peruvian honeybee - Drone" / etc.
The appearance and behaviors are quite different. Just the page structures are the same.
[edited by: caveman at 6:17 pm (utc) on Oct. 3, 2005]
So what's the problem with a session id, and why doesn't Googlebot crawl them? Well, we don't just have one machine for crawling. Instead, there are lots of bot machines fetching pages in parallel. For a really large site, it's easily possible to have many different machines at Google fetch a page from that site. The problem is that the web server would serve up a different session-id to each machine! That means that you'd get the exact same page multiple times--only the url would be different. It's things like that which keep some search engines from crawling dynamic pages, and especially pages with session-ids.
Also re how much original content needs to be applied to get past the Dup Content penalty--it's around 12%. Just write an intro paragraph above each article equalling over 12% and be sure and make the title and description apply to that article and change that a bit also just to be safe.
I have a whole website full of articles and don't have a Dup penalty on any of them. They are all static pages and all original content and even though they get copied at times I pursue those culprits like a crazed pit bull with a bee attached to his rear and get their content removed if not their whole website. It's a lot of work to chase them down but if you don't do it you loose.
One more thing in advance - don't forget to take Google Sitemaps into the equation. I believe it's one of the major reasons for sites to take a duplicate content hit at the moment. Just think - if you don't submit the URLs exactly the way Google has them listed (parameters, case, parameter-order, etc.) then Google will regard them as seperate URLs. If it found these "in the wild", it would just disregard them. But since you're feeding them to Google, it will be forced to take a look, partially indexing them, noticeing they're duplicate (in regards to already indexed ones), and pushing them back out. The problem with Google-Sitemaps that is different to "in the wild" links is that Google queries the sitemaps-files 2x/day. So you're really submitting those "bad" URLs 2x/day, Google has no chance to get past the partially-indexing/notice-it's-duplicate part, it keeps getting pushed in and out... So if your sitemap file isn't the same way Google has indexed your existing site, then you WILL run into this problem. If the site is new then it can also be an issue - Google will still run into it when crawling normally. "Cloaking" parameters (making sure the SEs get the right set) and redirecting when not on the SE-optimal-URL is a way, but very difficult and who knows how Google might recognize that as BH-cloaking and ban you completely...
But let's look at it differently -- if you ran a search engine, how would you tackle the duplicate-content issue (for pages which aren't 100% duplicates)? Without going into sematics and anything exotic (won't work on those Swahilii-Spammer-Sites anyway), I'd try the following approach:
Split the pages into distinct sections, use the HTML-block-level elements, say P, DIV, TD, Hx, etc. to help you. Doing that, you'll find elements which are general to the site (menu bars, headers, footers, sidebars). Comparing those blocks to other blocks in the same site is pretty easy. What's left are things that change from page to page.
Now we're in the content-level, the meat and bones of the page. We can still dissect it more: do we have any small/short sections that seem to be of no use? Dates, numbers, simple links without any textual information? Throw them out. They can't be relevant if no Text is around them.
We can now safely parse the Hx information, store that with the keywords, etc. for later.
The rest: Compare the blocks to other blocks on other known sites. Do they match? Perhaps they were copied? Perhaps this is an affiliate site? Perhaps they put "$$" on the pages in random places? (oops, forget that one.) Throw everything out that matches blocks from other sites.
What's left? specific content for the page. If nothing is left, throw the page away, the webmaster isn't publishing anything new, bye bye, serp. Perhaps the page with the original blocks (first found) or the one with the highest PR can be kept (Amazon itself perhaps)?
So what else do we kill like this: affiliate sites with standardized content on customized templates, RSS-scrapers (good, no?), DMOZ scrapers (ditto), site-scrapers (ditto, unless it hits your original site... ouch! then contact Google), people copy+pasting partial content (in full block-level blocks) from other sites, etc.
IMHO, that's the way I would go at it. No need for complicated sematics, no need for "content recognition", synonym detection, etc. These things don't scale very well to other languages, it would be a waste of time for Google to go into them. Sure it's interesting, but there is no plausable way Google could reasonable do it (at the moment). Block level content detection, say based on a simple 512bit-hash of the content, would be really easy to do and could go a long way.
That would also explain why my test-sites are still up and indexed - they contain duplicate content on a sentance level, but not on a HTML-block level. I must admit, I have no idea how it would relate to bees, though.
Note: I'm just a simple mind at work, can't compare to the great minds at Google :-)
ALl fair comment, but how do you do that?
If you sell 1.0mm widgets, 1.2mm widgets and 1.6mm widgets, widgets which are otherwise identical, how do you prove to Google that this is legitimate?
Also, more generally, let's all try to be careful to stay on topic. The issue outlined in this thread has nothing to do with external sites.
You see my site is a software site and since its relativebly new I get most of the programs defenition from cnet or softpedia, now is this also seem by google as duplicate content. If so a lot of sites will loose their pages in google too since all software have only one way to define them :)
anyways is just a site I started and it wasnt doing that good anyways, but still ehis dropping of site should be done in a more suphisticated way I think.
You see my site is a software site and since its relativebly new I get most of the programs defenition from cnet or softpedia, now is this also seem by google as duplicate content. If so a lot of sites will loose their pages in google too since all software have only one way to define them :)
I think this will definitely be seen as duplicate content. The appearance of scraper sites not helping your cause. You probably would have been fine doing this 2 years ago.
G needs to sort this out."
Exactly, basically all the recent google tweaks have resulted in my needing to do an unprecedented amount of google specific tweaks just so it can handle the data I give it, whether it's www rewrites, index.htm -> / rewrites, full explicit moved page 301s rewrites, to avoid possible dupe penalties, to more stuff that's too boring to talk about. But the overall affect is that I am having to consider google's requirements now before I even start recoding a site, and install them from the beginning. This has nothing to do with my information, it's a direct result of me having to organize my information for google. In other words, google is no longer able to 'organize the world's information', it needs me to do its work for it.
Google Site maps are an especially obvious example of this failure.
I can accept this, but it's actually getting ridiculous how much I have to do to make sure my content does not trip some filter or other. This is a failure as far as I'm concerned, and on a fairly deep level.
I am not talking about spam here, I'm talking about making sure google doesn't think I'm presenting dupe content when I'm not, things like rewriting all index pages to / and so on, since google by itself seems to be requesting pages that are not even linked to, things like testing /
But here's one very recent thing I saw on a site, it's properly search engine friendly, and google had the pages indexed at roughly 2x the actual total for a year, but recently it decided to roughly increase the total page count to 50x the actual total. Literally. The only way this could have happened is that if it's including each and every link that is blocked by robots.txt in the site total page count. I've seen this behavior on a few different sites now, it's very recent, a few weeks at most.
Interesting observations caveman, this explains an oddity I've been watching for about 4 months, I did a small site, but expanded it without completely filling out the new content pages (too lazy, figured it's easier to create the pages all at once and put filler on them then pad them out later than to hold them all back and add them in later), google has steadfastly refused to spider anything but the major index pages of the site, no penalty or anything, but just won't run through them. Not a big deal in this case, small site, small client, but interesting anyway as a case study.
Your observations also make me wonder about a roughly 5-8 place drop we saw for a single keyword on one collection of sites, we've been trying to figure out the cause, but nothing stands out since sites rank exactly as before for all other major keywords.
I have to wonder though, not sure it's related.
Anyway, on sites where I've written all the content, I'm not seeing any such dupe content problems, obviously. But on other sites, I have to wonder, there's a lot of pages, and I haven't really read them to see how repetitive they are, or if we've accidentally reused articles etc, it's quite possible.
[edited by: 2by4 at 8:25 pm (utc) on Oct. 3, 2005]
Please confine comments and opinions to the prospect of many if not all similar pages within a given site being banned because they are seen as too similar to one and other.
The issue outlined in the opening post of this thread has nothing to do with external factors.
2by4, yes, that's the sort of thing I see. Quite a number of unique examples too. We can find no other plausible explanation for it. It's as if G decided any site that has a certain configuration of too-similar-pages is spamming, and all of those too-similar-pages are filtered out. In most of the cases I'm looking at however, those filtered pages are still getting crawled. Whether that lasts is anyone's guess.
Phil_AM, I cannot say with certainty that there IS a filter. I can only describe what I see and what I believe is going on. Even if I'm way off about how it's happening or why, the results remain (i.e., in some cases all similar pages within a site are being filtered, not just some of them).
My personal belief is that it is based on both page structure and text. Not one or the other. For all I know, nav and/or kw analysis may have something to do with it also.
Which means that we aren't looking at the question right as far as I'm concerned. So turn the question around until you can find a plausible explanation that can actually handle the new phenomena you see. Standard scientific method, if your astronomical observations get good enough to make the idea that the sun rotates around the earth unfeasable, dump that theory.
<theory>When I started seeing certain types of errors and changes earlier this year, things we just hadn't seen before, I started realizing that there is in fact a possible pattern, especially if you look at certain very interesting threads from 1 and 2 years ago.
I am guessing that google has in fact created a new algo. I started suspecting this because certain of the so called 'tweaks' I'm seeing, and certain 'errors' in how google is requesting pages and urls, have a shared quality, which doesn't become obvious until I stepped back a little. When I say 'obvious', I don't mean this is a fact I can prove, it's just a suspicion.
To me these errors all share a raw, almost beta quality, and do not seem like an upgrade, but tweaks to get a new program tuned.
<added>the timing is especially interesting to me, as anyone who's done reasonably complex programming knows, there's no way to actually test a new system until you start feeding realworld data through it, and you do that at the lowest traffic times, ie the slower summer season. If you think of the algo as a big box, with a bunch of switches on it, it looks to me like this summer google engineers have been flipping certain things on and sometimes back off to study how it functions with real data flowing through it.
The 25x increase in total pages I noted [2x25=50] would be an example of one such switch being flipped.
All major software systems tend to get rewritten every five years or so, it's just not possible to write something so perfectly that you take everything into account. That applies to oses, big apps, little apps, and google's stuff.
Would certainly account for a lot and go with my feeling that over the past year they have been fixing things that weren't broke - problems in a totally new algo would account for that, reintroducing problems the original algo solved years ago.
Problem is, supposed offseaon or not, I find it prety cold that - unless the Googleplex is located on Mars - they have to know these little experiments are costing legit businesses an aggregage of millions of dollars in lost revenue.
Always been a problem with Google tho: without the work of individual webmasters they have no product whatsoever, but they have always dealt with all but the ebays/amazons of the world like we were freeloaders deseving no consideration whatsoever.
I've suspected that google has switched for a while, but caveman's observations really clicked something in my head, I've been so busy saving client's sites this summer that I hadn't had time to really step back for that overview of just what it was that lay behind all the stuff I was doing, but now it's starting to click:
Everything I did this summer to save sites that fell, and am now having to always do on every site I do, involves a significant tightening of discipline, no errors allowed. Previously we got away with massive errors. What does not support this type of error? A brand new system written to be tighter, that tightness gives greater control, but also hurts people who were able to get by with errors in the past.
All sites I do that were written from the beginning with no errors of this type, or where I fixed these types of errors a year ago, have seen zero negative affect in the serps, in fact, almost all are performing better and better. But older sites, with sloppy components, code, webmastering, dupe content, etc, are behaving RADICALLY unpredictably. Huge fluctuations, corrections, new fluctuations.
In other words, when I feed google clean, unambigous data, it works very very well. When that data is corrupted, it becomes extremely unstable internally in the google algo. This is exactly what I would expect from a new algo, since it's exactly what I see with my own web programming with each new thing I develop.
Just like say using XHTML instead of slop html, or switching to objects instead of running all spagetti code [say windows nt 5 versus windows 98]
If this is a rewritten algo, obviously the google engineers would have rewritten one of its weakest components, the dupe filter. But a new piece of software has to be tested, then you have to start pouring real data through it. You can only test in the labs so much.
I don't believe google had the luxury of postponing this rewrite like say Microsoft has with Vista, I think they were up against a wall and had to do the rewrite, this was becoming fairly evident a while ago as far as I'm concerned.
The relative smoothness of the transition is a testimony to just how good the engineers they've been stockpiling over the last year are. But at some point if you see enough trees it might pay to ask if you are in fact in a forest.
I'm usually happy to criticize google, but since I do enough programming to have some sense of just how difficult it is to introduce new stuff of any comoplexity live, I have to look at the collateral damage slightly differently, you can't come out the door perfect, but at some point you do have to make the decision to walk through it and put it live.
But, again, the dupe content issues caveman raised seem much more like beta issues than mature algo tweaks. Which suggests to me that this stuff will improve over time as they analyze the data they're getting.
A lot of other things are explained to as far as I'm concerned. My guess is that the new algo is much less sloppy than the old one, they have after all learned a lot in the last 7 years, and probably had a very long wish list that simply could not be implemented if they tried to force more and more junk into an essentially obsolete piece of software.
archive and real threads the same
posts and real threads the same
print version and real threads the same
There's so much dupe content inside vbulletin every forum running it would be penalised.
But then look at [google.com...]
2nd is a vbulletin forum.
I tend to judge new theories with my own personal smell test. The test includes questions like:
- Is the theory all new, or is it more an extension/expansion of known SE behavior? (I shy away from "all new" most of the time; most changes are evolutionary, not revolutionary.)
- Do others see similar issues/patterns, or am I conjuring up something related only to my own little world?
- Can I find a way to disprove the theory?
So far, what I'm thinking still feels right to me personally.
That said, nothing in your post above feels necessarily wrong to me. In fact, it may be that what I'm theorizing is a part of a new algo that you are brave enough to wonder aloud about. Honestly I don't know, and I'm afraid I'm not smart enough to sort that out. ;-)
The one aspect of the new algo post I disagree with is the conclusion that "beta issues than mature algo tweaks." I believe that that statement is ONE potential view of the problems wer'e seeing. But I also believe it is quite possible that G is accepting an ever increasing level of collateral damage, and that with the tweaks of roughly 9/22, they may have just underestimated how much collateral damage there is.
The algo's these days are so complex, that I sometimes wonder how many people can fully understand or predict the impact that even a small tweak can bring.
I have info white hat website with no ads that is still ranking well. it has no duplication issues except one. All the page titles are usually different keywords, very varied on this domain. But on one page the title includes the two words from the home page. This page isn't cached by google. The content on this page is totally different to that on the home page however. The only similar feature is two title keywords.
Logically, Google would not want to eliminate even the "original" copy of the duplicated pages.
As well, it would make sense for Google to change its approach to duplicate content -- including internally duplicated content -- because this is an area where its algorithms and SERPs have been vulnerable to manipulation.
Any site that creates numerous near-duplicate copies of the same content could potentially receive a boost in the SERPs as a result of increased internal link "votes", increased anchor text volume, etc. A few years ago, these side effects of internal duplication were mostly inadvertent and infrequent, but with increased SEO savvy and increased spamming activity, this once-minor problem might be looming large on Google's priority list.
If so, it may be experimenting with some drastic algo changes that try to prevent any double counting of the same content, while accepting a temporary risk of leaving some duplicate content out of the SERPs entirely.
Weirdly functioning Dup Filter is one but so is some kind of internal/external link value change and - possibly - some sort of "thin affiliate" filter gone mad (my suspicion is its defining legit affiliate practices like hotel sites using an offsite booking engine as a questionable practice).
Of course, all of these can be subsumed under problems with a new algo, but it makes fixing whatever particular practice - many that have been fine for years - that is the killer mostly guesswork (and is there 1 killer or the cumulative effect of several that would merely maim independently?).