homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 154 message thread spans 6 pages: < < 154 ( 1 2 3 [4] 5 6 > >     
Duplicate Content Observation
Some sites are losing ALL of their relevant pages

 7:05 pm on Sep 29, 2005 (gmt 0)

We've just done a whole bunch of analysis on the dup issues with G, and I wish to post an observation about just one aspect of the current problems:

The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they're throwing out ALL the similar pages.

The result of this miscalculation is that high quality pages from leading/authoritative sites, some that also act as hubs, are lost in the SERP's. In most cases, these pages are not actually penalized or pushed into the Supplemental index. They are simply dampened so badly that they no longer appear anywhere in the SERP's.

The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down.

From early Feb through about mid March, the situation was corrected and the best pages for specific queries were again elevated to higher rankings. When indented listings were involved however, the indented listing seemed now to be less relevant than was the case pre-Dec.

In mid March to about mid May, the situation worsened again, approximately to the problems witnessed in mid Dec., i.e., the most relevant pages dropped 5-10 spots, indents vanished as was the case in Dec.

But the most serious aspect of the problem began in mid May, when G started dropping even the best page for the query out of the visible SERP's.

A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread [webmasterworld.com] I've ever seen.

Why This is Such a Problem
The short answer is, that a lot of very useful, relevant pages, are now not being featured. I'm not talking about just downgraded. They're nowhere.

Now, I'm sure that there are sites that deserved the loss of these vanished pages. But there are plenty of others whose absense is simply hurting the SERP's. There is a difference between indexing the world's information, and making it available after all.

Hypothetical Example

We help a client with a scientific site about insects (not really, but the example is highly analogous). Let's discuss this hypothetical site's hypothetical section about bees. Bees are after all very useful little creatures. :-)

There are many types of bees. And then there are regional differences in those types of bees, and different kinds of bees within each type and regional variation (worker, queen, etc). Now, if you research bees, and want to search on a certain type of bee - and in particular a worker bee from the species that does its work in a certain region of the world, then you'ld like to find the page on that specific bee.

Well, you used to be able to find that page, near the top of the SERP's, when searching for it.

Then in mid Dec, you could find it, but only somewhere in the lower part of the top 20 results.

Now, G is not showing any pages on bees from that site. Ergghh.

What is an Affected Site To Do?
One option, presumably, would be to stop allowing the robots to index the lesser pages that are 'causing' the SE's to drop ALL the related pages. But this is a disservice to the user, especially in an era when GG has gone on record as taking pride in delivering especially relevant results, and especially for longer tail terms.

Should we noindex all the bee subpages, so that at least searchers can find SOME page on bees from this site? (I'm assuming that noindexing or nofollowing the 'dup' pages that are not really 'dup' pages at all would nonetheless free the one remaining page on the topic to resurface; perhaps a bad assumption.)

In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing.

G needs to sort this out. ;-)

Note: Posters, please limit comments to the specific issues outlined in this thread. There are a lot of dup issues out there right now. This is just one of them.



 11:19 am on Oct 4, 2005 (gmt 0)


one more thing you could do is to check the correct url within your code.

In some kind of pseudocode:

if (thisPageURL <> lowercase(thisPageURL))
redirect301(thisPageURL, lowercase(thisPageURL);

I did kind of that for my pages and it works fine. (My pages URLs are created using the headline. So if the headline changes, URL will change too. To avoid dupe content I'll redirect the old URL to the new one.)


 11:25 am on Oct 4, 2005 (gmt 0)

>>>>Another way to think about this is previously Google has stated pagerank was a deciding factor in choosing a canonical page... in this case that is 100% not the case; instead, the choice of the canonical page is made strictly based on age.

I know this is getting OT - but lots of homepages were removed (apparently totally) from the index and then re-indexed. (while pages within the site remained)

So if those sites rely on the oldest page being indexed as the Canonical - they are screwed.

I am seeing the same as Steveb in a site search eg:-

Bad Supplemental Match.
Bad Supplemental Match.
Bad Supplemental Match.
Good Recent Cached Match.

and then it shows in normal serps where the top 2 can be listed and then the rest omitted with the more from this site option available.


 1:32 pm on Oct 4, 2005 (gmt 0)

Unfortunately, I can't create temporary pages as 2by4 has suggested because the ones with some uppercase letters are really the exact same pages as the ones that are correct with all lowercase letters.

Windows servers allow you to call a page with uppercase letters and it will return results for those pages, but it does not allow you to have actual unique pages with and without upper case letters. So when you call a page with uppercase letters and call a page with lowercase letters, although the urls will be different and in Googles eye unique, the page is really the exact same page on the server. So you can't do a 301 meta redirect because you will be also redirecting the ones with all lowercase as well (albeit you'll be redirect those to themselves). I think that will cause even more problems.

Additionally, if you 301 them, it will just move them to the supplemental index and IMO that will not fix the problem because Google it appears still levies the duplicate content penalty even if one of the duplicates is in the supplemental index. Therefore you can only get the duplicate content penalty to go away if one of the duplicates is removed completely from Google's indexes (the main index and/or supplemental index, but preferably the one in the supplemental index)


 9:11 pm on Oct 4, 2005 (gmt 0)

We followed everyone's advice on the htaccess file editing, the frame breaker code...and on and on....

Guess what....worked great in MSN, we are back with a vengeance, as for Google, just more of the same old
penalty, no penalty, double talk, triple talk...no clue
bs, as to why our website remains depressed when searching for mydomainname.com


 11:59 pm on Oct 4, 2005 (gmt 0)

ledfish, good points, sorry, I should have remembered about the case non-sensitivity, but it's been so long since I've used windows servers that these details slipped my mind, I'm used to having full control.

Again, we resolved our issues by moving the site to Apache, but obviously for sites that have extensive programming this isn't an option, but for all you webmasters out there who can chose now, keep these types of issues in mind when you decide which platforms to work with.

That is a quandery though, definitely, how to tell a case sensitive bot [and of course filenames have to be case sensitive internally or the bot couldn't get the correct pages on all non-windows machines, ie about 70% of the web] that a certain case needs to be redirected to another case when the server can't see the difference.

I'll add this to my long list of reasons why I won't use IIS for any reason.

Almost all our problems were direct consequences of using windows hosting on that one site.


 3:19 am on Oct 5, 2005 (gmt 0)

Well, I might have found a partial solution.

We have ISAPI Rewrite on our server and ISAPI Rewrite is case sensitive, so what I did was made a rule so that urls with uppercase letters in them get served a page that does not exist, thus when google tries to visit any of those pages with the uppercase urls, it will get a 404 error basically.

As I said, the problem was caused by us, so for 99% of them, the url with the uppercase is consistent making it easy to trap them with ISAPI Rewrite.

Now the only remaining problem is getting them removed via the google removal tool because there are about 700 of those dang pages. That's alot of typing. What's more frustrating is that because Google is not case sensitive, I haven't found a way to query google to return a list of all those pages with the uppercase letters in the url, that are in googles index.

What a nightmare, but at least knowing Google is now getting a 404 error on the pages, it's not quite a scary


 5:18 am on Oct 5, 2005 (gmt 0)


all though the logic is pretty much resolves the issue, i think it should be more of preventing the issue from happenning from the get-go.

if (thisPageURL <> PROPERcase(thisPageURL))

if (page was never spidered in wrong case)
{serve friendly-404 with link to correct link;}

else {redirect301(thisPageURL, PROPERcase(thisPageURL); }


 7:35 am on Oct 6, 2005 (gmt 0)

One of my sites has category pages (say example.com/cat1/) containing a short paragraph for each product taken from the main product page (say example.com/cat1/product1/). Although the product pages typically have another 300-400 words of unique content, all pages of the form example.com/catX/productY/ have gone URL only.

Yes, arran, this is exactly what happened to my site too. I think I can even pin-point the date the problem began on my site: 16th July. From then on, referals from Google seemed to slowly decrease, after a long time of slow increase (which is no surprise as I was adding content again and again).


 10:00 am on Oct 6, 2005 (gmt 0)

WHY issent google replying to this topic that concerns SO many webmaster the last years time, special that the non www simply dont vanish from the serps on google and then many are hit by the dublicated filter and of cause what shall we with supplemental results.

We are not moving forward in this and the more we try to fix it ourself the more we could ruin our "REAL" rankings or listings on other SEs.


 10:05 am on Oct 6, 2005 (gmt 0)


They are not responding - but let us hope they are listening and working on the problem (which I hope they see it as)

Next update - whenever it is - will see if I give up on my business and move onto something else.

Que Sera Sera


 4:05 am on Oct 7, 2005 (gmt 0)

I think Google inadvertly created this problem by way of trying to eliminate site that were substantial duplicates of other sites. I have seen slews of sites that you can tell are just a different presentation of the same material and after examining them, I also have noticed that many of them are owned or managed by the same person or organization.

I also think that because we spent so much time complaining about scraper sites, Google decided that when it came to duplicate content, Google was not going to choose sides except when it involved a DMCA violation and thus decide that in the case of duplicates, it would just deep-six all sources. The reason for this is that they probably don't have a consistent and fair way of determining which site is the true and legal creator of the content in the first place unless like I said before, it involves a DMCA complaint. Using the oldest existing domain method was not fair, because age of a site is not a sure way of determining whether something was original or stolen. I was the victim of some scraping and because the scrapers site was old then mine, I got penalized. I was only able to regain my standing in the rankings after a successful DMCA complaint filed with Google and a legal war with the infringer.

Anyhow, because of all this, I think the duplicate content knob got turned up so high though, that it has harshly affecting duplicates with-in sites. Often duplicates within a site are more likely to be unintentional rather than intentional. My current situation with the two urls is a perfect example.

But I agree with what someone else said, I believe Google is trying to address it at least from a with-in the same domain standpoint and probably will in a future update. At the same time, the duplicate content filter being so high has to be hurting those who rely on stealing others content and making a living off of it. So the longer they see their ill gotten profits sinking, the quicker they will be trying to figure out how to reverse that trend and the only thing to do is get away from the duplicate content. It's unfortunate though that at the same time the person who is having their original content stolen also has to pay the price and I don't think anybody wouldn't agree that it's just not fair.


 7:54 am on Oct 7, 2005 (gmt 0)

... the sad thing is ... <mean guess> that the longer / the more this duplicate-content issue "blocks" legitimate websites, the more those sites will spend on adwords to get the same traffic </mean guess> - meaning Google will profit financially from sites not being indexed properly (even though it might harm them in the long term because of other SEs being given more traffic)...


 8:46 am on Oct 7, 2005 (gmt 0)

The easiest way to do this surely would be for Google to seperate commercial listings and make people pay for a listing. I can't imagine that businesses would not be willing to pay to ensure a listing - they do for everything else. I would happily pay for my site to be in Google if it meant that this kind of thing stopped happening. Then they could also specify that a particular site not be allowed a listing because it does not meet quality guidelines.
If there is a dupe content filter, it is really going to hurt online businesses, because so many are selling the same products or will have similar problems listing their items for sale in a way that doesn't trup a dupe filter.
If G is telling us to build sites for customers not SE's, I wish they meant it.
A paid listing would mean you could build your site the way you want in a sensible, commercial way, not constantly on the edge of a nervous breakdown.
Yes I know Adwords is designed for that, but we get few conversions from that (guessing that people only tend to click on page listings not Adwords).


 11:30 am on Oct 7, 2005 (gmt 0)

Someplace I saw it mentioned that retuning a 404 may not be the best response since the spider will keep trying to fetch that url assuming it may be a temporary problem. I think the suggested response was a 402 or a 405. I can't seem to find the message.


 11:41 am on Oct 7, 2005 (gmt 0)

The suggestion was to use a 410:-



 1:05 pm on Oct 7, 2005 (gmt 0)

It can not be that we have to do all this stuff just to please google, this wil ruin your site with time and the rankings on other SEs, create you site as you like and make it easy to use for your users, dont make it for the Google, because thats the only search engine that has all these troubles these days, also remeber google also said make the site for the users, but ok they also said it was impossible to ruin a site from outside.


 1:22 pm on Oct 7, 2005 (gmt 0)

I appear to have ruined mine from the inside by trying to make it more user friendle, i.e. by linking relevant products and having products showing in more than one section :(


 1:42 pm on Oct 7, 2005 (gmt 0)

Not sure how I would create a 410 on a windows platform or with ISAPI Rewrite which would be more important because I'm currently using ISAPI to redirect to a 404 state. ISAPI doesn't appear to have a [G] switch. If it does, it's not in the documentation.

Anyhow, with a 404 I can at least use the removal tool to get those pages out of the index and that is better than any other option I have right now. The only problem is that I have to remove the pages one at a time and there is about 800 of them.

BTW, Google still hasn't gotten back to me yet either on how they would suggest resolving the problem. That is the one thing that really frustrates me, getting any kind of response from Gooogle is like trying pull teeth. It's not like I'm asking for the recipe to the algo or anything. I presented a very real technical problem that had no intent to game their search engine or anything. I would have thought that they would have completely ignored one of the two urls, or treated them as the same since the only difference was the presence of some uppercase letters in one. but I guess I'm wrong.

It wasn't my intention to steal this thread, although I have appreciated all the feedback from people with suggests on resolving my unintentional duplicate content issue.


 3:02 pm on Oct 7, 2005 (gmt 0)

If there is a dupe content filter, it is really going to hurt online businesses, because so many are selling the same products or will have similar problems listing their items for sale in a way that doesn't trup a dupe filter.

Yes, but look at it from a user's point of view: What's the value in searching for "red widgets" and finding 80 boilerplate catalog pages for the same product in the first 10 pages of search results? Wouldn't it make more sense for those duplicate pages to show up in Froogle (rather than in the mainstream SERPs), which organizes the results by price to make duplicate listings helpful to the user?


 3:25 pm on Oct 7, 2005 (gmt 0)

Froogle is a good idea, though too few businesses seem to be using it at the moment.
If I want to buy a x brand washing machine, then I go to one of the shopping comparison sites, but I also use G to get smaller companies who can't afford to get listed on a comparison site, because they are often cheaper with a better service. I'd guess also that Froogle does not cater for businesses that supply services rather than goods.
This is always going to be a problem with commercial and non-commercial listings being listed chaotically as in a SE index.

I have no objection to any form of ordered shopping index or whatever, except this - customers hardly seem to use them.


 3:56 pm on Oct 7, 2005 (gmt 0)

There is a very inaccurate assumption in some of the posts of this thread. It is the assumption that most/all of the pages hit by this internal site filter, if it exists, are commercial.

While this problem clearly affects commercial sites -- where similar products, geos, and just plain spam pages are involved -- it also affects a disturbing number of niche autority and hub sites. These are sites with detailed and often excellent coverage of their topic. Unfortunately it is the breadth and depth of the coverage that now seems to be causing problems for these sites.


 4:05 pm on Oct 7, 2005 (gmt 0)

Sorry...I was not assuming that it only affects commercial sites. I am talking from the perspective of a commercial site owner.


 4:05 pm on Oct 7, 2005 (gmt 0)

caveman is right its all sites that are hit by this or what it is, Im also not sure about a internal filter it would be totaly nonses to have such one, I think its just a normal filter.


 4:06 pm on Oct 7, 2005 (gmt 0)

"At the same time, the duplicate content filter being so high has to be hurting those who rely on stealing others content and making a living off of it...It's unfortunate though that at the same time the person who is having their original content stolen also has to pay the price and I don't think anybody wouldn't agree that it's just not fair. "

So far, ledfish, far as I can see the original content producer is the only one paying the price.

In my case if I do allinurl: widgets.com the entire first page is scrapers stealing my content so how exactly are they suffering?

Right now the filter is filtering me, not them.


 4:58 pm on Oct 7, 2005 (gmt 0)

zeus, FYI, internal site filtering and/or the motivations behind it have been around for a long time now. It has long been more a question of what/how, than "if".

Back at some conference in '02 I think, a Google engineer was noted as saying that exact duplication will not cause sites many problems. (Google will simply try not to show exact dup pages in any given SERP.) Reason: There is a ton of exact duplication around the Web - think of works by great poets or writers for example, or political/historical documents.

The issue even several years ago was not exact duplication; it was "near duplication" ... too much similarity between pages.

When a substantial number of pages look very similar, it gives the impression that they are either not all worthy of being shown, or have been modestly changed to avoid filters. Again, that was a G engineer talking at a conference, two or three years ago.

My only question now is whether internal site filtering is suddenly taking out ALL similar pages from a given site, rather than just some/most.


 6:10 pm on Oct 7, 2005 (gmt 0)

I used to rank with &filter=0, but now my site is completly gone. Does anybobody have the same situation? What does it mean?


 8:48 pm on Oct 7, 2005 (gmt 0)

Do you all imagine amplifying the differences would make a change, eg. "red widgets <b>with</b> keychains", "red widgets <b>without</b> keychains"?


 12:53 am on Oct 13, 2005 (gmt 0)

I have noticed this problem with one of my sites recently. It's a three year old site but a lot of the pages are specifications for let's say televisions. Unfortunately in specifications sheets for televisions each spec sheet has to mention things like screen size, colours displayed, weight, dimensions and so forth.

The pages ranked very well until the last couple of months when suddenly all specifications pages disappeared or only show up as a url without title or description. That's annoying because it takes me a good three hours minimum to write a decent specification sheet and check it's technically correct.

If they were indeed pulled because of page similarity then what are people supposed to do? Find a new word for 'weight' or 'dimensions' or 'screen' every time a new specification has to be written? Yet dozens of other sites that share a copied and pasted generic specifications sheet from the manufacturer or wherever seem to do quite well with that.

It's quite depressing really, but at the end of the day I enjoy writing them and will do so regardless of what Google thinks about them.


 1:08 am on Oct 13, 2005 (gmt 0)

shraz, welcome to WebmasterWorld.

We're all hoping G will get it sorted in short order; lots of useful pages have gone by the wayside.

G's motiviation is understood. Their implementation needs some tuning.


 1:42 am on Oct 13, 2005 (gmt 0)

>> Find a new word for 'weight' or 'dimensions' or 'screen' every time a new specification has to be written?
>> Yet dozens of other sites that share a copied and pasted generic specifications sheet from the manufacturer or wherever seem to do quite well with that.

Because they either have substantial link equity or the net result is that their page has enough algorithmic unique content on it to bypass the filters.

Use copyscape (for starters) to determine what parts are being trapped as dupes and try and look at how you can solve the problem by getting an idea of what sort of patterns in these specs can be altered. That is mostly like what the larger reviews and product related sites have done.


 3:16 am on Oct 13, 2005 (gmt 0)

As many have said, you've taken down many good content rich, spam free, code clean, and spider friendly pages. I ask for what? We follow your TOS, I know I have code that validates and now you ask ME to provide you with a sitemap. Why? All the other search engines recognize my site and site's like mine and rank them accordingly, and even rather more quickly. I have one page, my index page which still ranks well, but all my other content is gone and buried so deep there's no bother looking. Do I have duplicate content NO? Do I use a navigation structure in template form to make surfing for my user's easier -yes. Doe's anyone else feel a dup content penalty for template pages may have been applied (although all meta tags etc, have been changed along with H tags as well)?

This 154 message thread spans 6 pages: < < 154 ( 1 2 3 [4] 5 6 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved