| This 154 message thread spans 6 pages: 154 (  2 3 4 5 6 ) > > || |
|Duplicate Content Observation|
Some sites are losing ALL of their relevant pages
We've just done a whole bunch of analysis on the dup issues with G, and I wish to post an observation about just one aspect of the current problems:
The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they're throwing out ALL the similar pages.
The result of this miscalculation is that high quality pages from leading/authoritative sites, some that also act as hubs, are lost in the SERP's. In most cases, these pages are not actually penalized or pushed into the Supplemental index. They are simply dampened so badly that they no longer appear anywhere in the SERP's.
The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down.
From early Feb through about mid March, the situation was corrected and the best pages for specific queries were again elevated to higher rankings. When indented listings were involved however, the indented listing seemed now to be less relevant than was the case pre-Dec.
In mid March to about mid May, the situation worsened again, approximately to the problems witnessed in mid Dec., i.e., the most relevant pages dropped 5-10 spots, indents vanished as was the case in Dec.
But the most serious aspect of the problem began in mid May, when G started dropping even the best page for the query out of the visible SERP's.
A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread [webmasterworld.com] I've ever seen.
Why This is Such a Problem
The short answer is, that a lot of very useful, relevant pages, are now not being featured. I'm not talking about just downgraded. They're nowhere.
Now, I'm sure that there are sites that deserved the loss of these vanished pages. But there are plenty of others whose absense is simply hurting the SERP's. There is a difference between indexing the world's information, and making it available after all.
|Hypothetical Example |
We help a client with a scientific site about insects (not really, but the example is highly analogous). Let's discuss this hypothetical site's hypothetical section about bees. Bees are after all very useful little creatures. :-)
There are many types of bees. And then there are regional differences in those types of bees, and different kinds of bees within each type and regional variation (worker, queen, etc). Now, if you research bees, and want to search on a certain type of bee - and in particular a worker bee from the species that does its work in a certain region of the world, then you'ld like to find the page on that specific bee.
Well, you used to be able to find that page, near the top of the SERP's, when searching for it.
Then in mid Dec, you could find it, but only somewhere in the lower part of the top 20 results.
Now, G is not showing any pages on bees from that site. Ergghh.
What is an Affected Site To Do?
One option, presumably, would be to stop allowing the robots to index the lesser pages that are 'causing' the SE's to drop ALL the related pages. But this is a disservice to the user, especially in an era when GG has gone on record as taking pride in delivering especially relevant results, and especially for longer tail terms.
Should we noindex all the bee subpages, so that at least searchers can find SOME page on bees from this site? (I'm assuming that noindexing or nofollowing the 'dup' pages that are not really 'dup' pages at all would nonetheless free the one remaining page on the topic to resurface; perhaps a bad assumption.)
In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing.
G needs to sort this out. ;-)
Note: Posters, please limit comments to the specific issues outlined in this thread. There are a lot of dup issues out there right now. This is just one of them.
Indeed, this seems to be what has happened to my site. The nature of our business means that many products do read similarly to a spider, when in real life they are totally different and cannot be included all as one product. I now have hundreds of pages in the Index which have no descriptions, I've lost a notch of PR and our site is filtered for the major KW's applicable, presumably some kind of penalty.
Also, how do you disallow G from product pages which are created dynamically from templates? Apart from disallowing for all product pages?
Tightening of the duplicate content filter/penalty between pages on the same site is exactly what i've seen.
One of my sites has category pages (say
example.com/cat1/) containing a short paragraph for each product taken from the main product page (say
example.com/cat1/product1/). Although the product pages typically have another 300-400 words of unique content, all pages of the form
example.com/catX/productY/ have gone URL only.
Another explanation would be a template penalty but i think that's highly unlikely.
So what percentage of dupe content would be acceptable?
I've just used a checker to compare pages and my home page is still 43% duplicate to a product page and 65% similar to a section page (after I'v been tweaking for the past few days to see how low I can get it).
I know it's only going to be a guess, but I'd guess that Google has a lower threshold than Yahoo or MSN as my site has not been affected on there.
I have hundreds of pages on my site with smallest PR value being 3. I am not seeing this issue with my site
My pages used to be indented and still are for some of the section pages, which are different due to categorisation of products on them, it is the product pages which are affected.
I was PR4 for three years - it just dropped and I have no idea why apart from the possibility that G has selected some virtually irrelevant backlinks for my site (though I have many good ones - they are not showing) and of course now there are a myriad of pages with no meta tags showing.
I don't do any black hat, or link to dodgy sites.
Do you mean that Google is checking not dup phrases of let's say three of for sentences but instead they compare pages on word level (amount of "working bees", "queens", "honey" etc. per page)?
I can confirm that. We have print and mail versions of each article on our page. No problem till last year. Then Allegra hit us and we took measures to solve the problem.
Due to a problem with our robots.txt these versions came back into the serps causing a site wide ban.
I wonder why Google's algo is not intelligent enough to sort out duplicate content *inside* one page. A print and a mail version is service to the users. They like it. (I've already fixed my robots.txt and hoping to return to the index.)
It's ok if Google does not like dupes that exist on different domains. But within a domain Google's attitude goes much too far.
There is nothing funny about being paranoid and sitting there staring at logs and serps waiting for Google's next move. Next thing they'll do is probably punishing sites for the use of duplicate words or letters.
Google is going too far with that and should fix this. I want to care for my readers first and not for Google's ideas.
>>>The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year.
Yes, this is very much when Google went wrong IMO - never really recovered from then.
Caveman - is this happening to site which definetly dont have canonical url problems?
Tinus, I mean that if they find four pages on the same site about a certain kind of bee, and the four pages are similarly structured, and one is a main page for that bee, and the other three are subpages about the same bee, each reflecting a variation of that bee, the site owner now seems to run the risk that they will find all of the pages too similar, and filter them all, not just the three subpages.
There are so many possible variations to what might specifically lead to all of the pages being filtered that generalizing becomes difficult past a point. Suffice to say that the more similar the related pages are, the greater the risk that all of them will be filtered out. At least that is how it seems to us, after looking at a substantial number of examples across sites in different categories, some owned by us, others not.
I can see filtering out the three subpages (IF it's really helping G in their battle against spam), since landing on the main page for a set of related pages will get you close enough as a user. But filtering all the pages just hurts the user, since the resource (i.e. the site) is now dropping out of the SERP's completely for some longer tail terms.
[edited by: caveman at 9:13 pm (utc) on Oct. 2, 2005]
I had actually considered deleting my internal pages for two reasons: one, because Google now gives webmasters who have stolen my content the credit, AND Google is doing as caveman describes and dropping those pages for being too internally similar to other pages on the site.
My site did much better when it was an 8 page website. As soon as it became 30+ pages of content the site was PR0, and all pages dropped.
caveman, you have outlined the situation very well. I have experienced the same thing and agree with your assertions.
And if this is the case, I cannot think of any good solutions for getting around this filter.
One ridiculous solution I thought of is that you could put a hundred page website all on your index page and then use targets to go to each "page". But of course, this is not very feasible or desirable.
However, I still do see a few sites ranking that have many near-duplicate pages. I'm not sure how they manage to avoid the filter, but there seems to be a type of "get out of jail free card" that lets certain sites avoid the filter, IMHO.
caveman, I have one site with the same issues. But its now gotten to the point where Google has dropped 50% from the index altogether and 25% are now supplemental results.
[However, I still do see a few sites ranking that have many near-duplicate pages. I'm not sure how they manage to avoid the filter, but there seems to be a type of "get out of jail free card" that lets certain sites avoid the filter, IMHO.}
When the changes occur which displace sites, not all sites seem to be hit at the same time. I remember Florida when people were saying that their site was fine, and a week or two later saying the same thing had happened to them. Being non-tech, I don't really understand why some sites should be affected before others...
I'm getting really mad about this now - I'm just going through the 'dupe' products to see if I can change the wording to make them more different. We sell some neck widget and matching wrist widget so the items are almost identical and our shopping cart has a facility for linking the products together for the customer's benefit. Just the act of linking the products makes the pages so similar G can't tell them apart. I'm having to unlink all the linked products - we are supposed to design sites with users in mind, and then this is what happens!
It's times like this that you really wish you could pick up the phone and speak to G to tell them just how illogical this all is.
The only alternative to me is to buy another expensive shopping cart for £800, which allows you to list products with options all on one page instead of having individual product pages. Thes3e sites don't seem to have suffered the same way.
Yes, yes this is our experience too.
We are unlucky enough to have one of our sites concern widgets. These widgets are different in different geographies, and by age, and by model, and by price etc. They are in fact quite different widgets. But not to Google.
We have been writing unnatural nonsense content for some months now in a very time consuming attempt to have them show up in the serps.
Then there is also the further problem of other sites scraping our deep internals and rendering them dup. Our only choice is to rewrite these too.
The other issue is that when we rewrite a dup page Google comes along in a week or so and re-lists it (which is good I suppose) but then it may drop the page as dup again a month later because the dup algorithm has changed.
There really is no stability any more…
nice outline of the problem. Please allow me to be the devil's advocat here, just for argumentation reasons:
1st: Did you structure the clients site from the beginning to create a lot of subpages, or does the kind of information FORCE you to structure it like that?
Background idea of that provocative question:
If you pretend, that google is very smart and not broken - their filter maybe a very good one by concept. If you have some "bees", which have (eg) 200 words of information and only 15 words (color, function, region) differently, your 3 subpages per bee are highly redundant and your site might be not structured well.
Means: your previous top position in Google was undeserved, because you have highly redundant pages around 1 topic, which could be on 1 page, not 4. The 3 pages around that one topic (long tail, as you said) gave 3 backlinks to the main page, I presume. That boosted the main page up - artificially though.
From what you describe, I am even more sure, that the dupe filter especially applies to dupe pages interlinking, fighting the effect, that people try to spin off more pages from one content source, creating "on-site linkfarms" (can I trademark that, pls?).
|In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing. |
Still wearing the devil's advocate hat: Google does not care, why you created 4 very simliar pages around 1 bee. You look like a optimized site (and honestly: I guess you are) which works with artificial internal link touring on duplicate content. If that structure is template driven, I would tweak it!
IMHO has google all rights to do with their product whatever they want and kick out whoever they want. If you want to rank good, adjust to that. Guess you have found out all "how tos" to avoid the dupe content penalty. Seeing like that: you are "rigging" the site for the user, if they can find the info better afterwards and NEVER for google.
I want to rank high in Google, so my potential users can find me: I am not optimizing for Google, I optimize for my users, because they use google :-)
Regards from a wodka lemon drinking person next to me ;-)
caveman, your observations are interesting, but the assumption that this is faulty behaviour may itself be questionable.
Perhaps the problem with "duplicate content" may only be a "bug" from a webmaster's point of view. From Google's point of view it may well be very purposeful.
After all, content on the same site which is too similar means that that site is highly unlikely to be authoritative in a search engine's eyes.
An authoritative site might be expected to have content that was authoritative for each subject.
... and indeed it may put it into the neighbourhood profile of the dynamic sites which produce pages by dropping keywords into standard text templates.
([Find "keyword1" "keyword2"s in "keyword3". "Keyword2"-Source is just number 1 resource for "keyword2"s in "keyword3". Etc. Etc.])
And these days, is that a profile anyone wants to exhibit?
Edit: pontifex got there quicker than me!
|... and indeed it may put it into the neighbourhood profile of the dynamic sites which produce pages by dropping keywords into standard text templates. |
yeah, we are bashing the same sandbag here. But that is a logical approach for google, if they really want to get rid of the dupes.
I wonder, if they applied that filter and said:
"well, some OK sites will vanish, too, but what will be left is really high quality"
That puts thousands of good sites into "danger" (are you working with RSS, too?), but gives Google an overall better index:
"You can't make an omelette without breaking eggs..."
Could be the name for this "non-update"-algo tweak.
>"well, some OK sites will vanish, too, but what will be left is really high quality"
I don't think they totally care just about what is left, but also how people react to it.
Since AdSense will let anyone and their dog make a buck off search and tons of hollow merchant & affiliate sites are gaming the system Google may want to add an extra opportunity cost in creating keyword driftnet type sites.
Still leaving one page in the SERPS was not much of an opportunity cost. Having none in there, well that hurts a bit more.
They may be trying to encourage producing actual information instead of raw automated page generation. If they go a bit too far with that sort of algorithm they probably do not care so long as the results are somewhat decent and they send their message along.
Most of the pages on the fringe which get sucked in by such algorithms would likely be affiliate content or hollow merchant databases.
Also what happens if the serps are biased toward information on commercial searches? People click the ads.
|you created 4 very simliar pages around 1 bee. |
I think the point is that they are different Bees, in several different ways. It’s just that Google isn't recognizing this...
There are many things in the world which may seem similar to a dup algo (with problems) but are indeed quite different to humans (the users).
If you have a site that concerns these types of things you will be having troubles with Google...
where is googleguy? or googleboy? or googledude? why don't they weight in.
I have completely given up getting in google and concentrate my time on the other SE's. Beleive me, there is life after google.
[ where is googleguy? or googleboy? or googledude? why don't they weight in.
I have completely given up getting in google and concentrate my time on the other SE's. Beleive me, there is life after google. ]
I would like to see Google comment on this.
As for life after G, my traffic stubbornly stays at around 85% G referrals, however hard I try to get business elsewhere. Even PPC is rubbish.
We dont see so much to him lately be cause there has never been so much serious writing about stuff that is hurt your site from the outside and we have NEVER seen so much wierd stuff in the index as in this year, wierd domains%20, omitted results after 3 pages, supplemental results, 302 linking, 301 problems, I could go on this is serious stuff, not how do I get indexed kind of stuff and they are now listed.
Why would you filter internal pages from google, if its under 1 domain name, you can not write a different text to every product or what about images the picture says it all not the text, so there is also a limit text there. If im hornest Im looking forward to Microsoft Vista with desktop search.
Out of curiosity, did you have many deep links going to the pages that have disappeared?
I would be interested to know if deep links from outside sites helps protect from this issue, or if the deep links are devalued along with the "duplicate" content.
I would concur with the OP's original observations. I have a regional directory of widget suppliers that experienced the same phenomenon. It does have a number of deep links to its various cats (US States). After the original drop in SERPs for these pages I went back and put an about 100 word generic description of the role of widgets in that State's economy and 3-4 weeks later all the SERPs were back where they were, or even improved, and the directory now ranks in the top 5 or better across most of the relevant regional search terms in its area.
The directory in question was generated by a home brewed script that simply filled a template with the listings. All I had to do was add a field to the DB and make one quick change to the template to accomplish the above.
Seeing the whole dup issue from SEO point-of-view, I think it's one of the reasons why off-page G ranking factors seem to matter so much now (due to so much of the same content (on-page factors) floating out there).
However, as Caveman observes, there is also a wide dampening filter based on dup content determination, which seems to be a wrong way to deal with this issue as it also has had a negative effect on the true authority sites that are usually the originators of the content.
I hope G would go back in time to the early days when they seemed to manually give extra link PR power to those sites that they deemed quality ones, such as the Yahoo! directory, which had a ripple effect, providing true authority sites the ranking boost they deserved.
I guess the TrustRank system is a step to that direction in many ways.
What's slightly odd is the falling in SERPs but with no supplemental/Url only results from the sites, if it wasn't the OP, I'd have suggested looking for another causation.
FWIW, I've just looked at a travel site, typed in a phrase in quotes and got back six results from the site, five supplemental, one normal.
So now if we webmasters post more then page on a topic like 'bees', then Googles thinks we are spamming web page. That's not right. I think we webmasters have giving too much power to Googles. Because it the webmaster that have the power to direct their viewers to other SE. Maybe in the future one SE will not have so much effect on so many webmasters and web sits.
Your points about the need for G to fight spam are fully understood and appreciated. In fact, that is why I noted that I could understand if G were filtering the three subpages that expand upon the differences in the species, as long as they left the main page on the species in tact.
For a little more background, most of the sites we work on we do for ourselves, but I also do a little client work, and the site that first got my attention WRT this issue was a client site. It's as clean as clean gets, primarily a scientific site, and reasonably well known among those who care about the topic. If you are a scientist, you care about and minor differences within a genus/species, and for reasons not worth getting into here, the site was developed with different pages for each variation. It was not done for SEO reasons; it was done long before the site operators had any sense of SEO (which today for them is at best a necessary evil).
There was a time when G would show all four "bee" pages from this site (i.e., the main bee page and the three subpages). Late last year that began to change, and it became obvious over the last nine months that G was trying to sort out which "similar" pages to show, and which to dampen. I outlined above the general evolution of what they were doing, at least as far as I was able to work out.
One could look at this site, and argue the merits of the need for the three subpages. I believe they are warrented. Any scientist would concur. Some laypeople might not. In any case, the pages are not spam in the eyes of the site creators. The one main page and three subpages exist for a reason: to help the users of the site.
The problem, most probably, is that as pontifex and stever imply, there are certainly parallel sorts of examples out there where spammier sites are doing essentially the same thing, to get more pages, and capture more long tail searches. This phenomenon probably has it's root in the Florida Update, after which SEO'ers saw the benefits of having larger sites and more deep/specific pages...but took things to extremes in some cases.
G and the other SE's have an ever more difficult problem when it comes to controlling spam. I am sympathetic.
What I am not sympathetic to is overly harsh filtering of pages that are entirely legitimate. There was a time when G was finding a way to show the main "bees" page from the site I've alluded to, while substantially dampening the related subpages. This was not ideal for the site in question, because it caused a minor drop in traffic and put some searchers one page away from the ideal landing page. But I understood the issue G was dealing with (I think), and again, I was sympathetic.
The issue now is that, as has been true on other fronts as well, G has gone too far, taking out an increasing number of legitimate sites and pages in their effort to stem the rising tide of spam. From interaction I've had with the SE's, I'm not entirely sure that they always understand this. There is a tendency among the engineers to at times get so caught up in the fight against spam that they become almost cynical when told that too many innocents are being taken out as well.
So IMHO, the SE's need to hear it from us when too many excellent sites are being hurt. I do not believe that is G's intent.
From a purely logistical standpoint, I can't see Google penalizing for duplicate content on the same site, or at least not very severely. They know very well, for instance, that many CMS's will have up to three URL's for the same article, depending on the sequence of clicks used to reach the article. For another example, I'm sure many sites have a "bee of the week" article on one URL, which will then be archived to a different URL next week when a new lucky bee is featured. Both of these examples are real-world types of scenarios, that could easily be "perpetrated" innocently by well-meaning webmasters. Google knows this, and I just don't see them penalizing for it.
Of course, a site with ten identical pages, or sites that are identical to other (older) sites, are pretty plainly spam, should be penalized, and are. But when it comes to Joe Beekeeper running his "spare time" bee site, I don't see how Google could consider it worthwhile to penalize his site for something Joe probably doesn't know anything about.
On an experiential note, I've got several pages on one of my sites where the CMS comes up with two URL's for each page. I've never seen a penalty for this, except that the page with the longer URL normally only shows up in the SERPS as a title and a URL. The shorter URL gets a description, too.
| This 154 message thread spans 6 pages: 154 (  2 3 4 5 6 ) > > |