| This 154 message thread spans 6 pages: < < 154 ( 1 2 3 4  6 ) > > || |
|Duplicate Content Observation|
Some sites are losing ALL of their relevant pages
We've just done a whole bunch of analysis on the dup issues with G, and I wish to post an observation about just one aspect of the current problems:
The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they're throwing out ALL the similar pages.
The result of this miscalculation is that high quality pages from leading/authoritative sites, some that also act as hubs, are lost in the SERP's. In most cases, these pages are not actually penalized or pushed into the Supplemental index. They are simply dampened so badly that they no longer appear anywhere in the SERP's.
The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down.
From early Feb through about mid March, the situation was corrected and the best pages for specific queries were again elevated to higher rankings. When indented listings were involved however, the indented listing seemed now to be less relevant than was the case pre-Dec.
In mid March to about mid May, the situation worsened again, approximately to the problems witnessed in mid Dec., i.e., the most relevant pages dropped 5-10 spots, indents vanished as was the case in Dec.
But the most serious aspect of the problem began in mid May, when G started dropping even the best page for the query out of the visible SERP's.
A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread [webmasterworld.com] I've ever seen.
Why This is Such a Problem
The short answer is, that a lot of very useful, relevant pages, are now not being featured. I'm not talking about just downgraded. They're nowhere.
Now, I'm sure that there are sites that deserved the loss of these vanished pages. But there are plenty of others whose absense is simply hurting the SERP's. There is a difference between indexing the world's information, and making it available after all.
|Hypothetical Example |
We help a client with a scientific site about insects (not really, but the example is highly analogous). Let's discuss this hypothetical site's hypothetical section about bees. Bees are after all very useful little creatures. :-)
There are many types of bees. And then there are regional differences in those types of bees, and different kinds of bees within each type and regional variation (worker, queen, etc). Now, if you research bees, and want to search on a certain type of bee - and in particular a worker bee from the species that does its work in a certain region of the world, then you'ld like to find the page on that specific bee.
Well, you used to be able to find that page, near the top of the SERP's, when searching for it.
Then in mid Dec, you could find it, but only somewhere in the lower part of the top 20 results.
Now, G is not showing any pages on bees from that site. Ergghh.
What is an Affected Site To Do?
One option, presumably, would be to stop allowing the robots to index the lesser pages that are 'causing' the SE's to drop ALL the related pages. But this is a disservice to the user, especially in an era when GG has gone on record as taking pride in delivering especially relevant results, and especially for longer tail terms.
Should we noindex all the bee subpages, so that at least searchers can find SOME page on bees from this site? (I'm assuming that noindexing or nofollowing the 'dup' pages that are not really 'dup' pages at all would nonetheless free the one remaining page on the topic to resurface; perhaps a bad assumption.)
In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing.
G needs to sort this out. ;-)
Note: Posters, please limit comments to the specific issues outlined in this thread. There are a lot of dup issues out there right now. This is just one of them.
As many have said, you've taken down many good content rich, spam free, code clean, and spider friendly pages. I ask for what? We follow your TOS, I know I have code that validates and now you ask ME to provide you with a sitemap. Why? All the other search engines recognize my site and site's like mine and rank them accordingly, and even rather more quickly. I have one page, my index page which still ranks well, but all my other content is gone and buried so deep there's no bother looking. Do I have duplicate content NO? Do I use a navigation structure in template form to make surfing for my user's easier -yes. Doe's anyone else feel a dup content penalty for template pages may have been applied (although all meta tags etc, have been changed along with H tags as well)?
> If they were indeed pulled because of page similarity then what are people supposed to do?
With commercial sites you can try to make the duplicated information (from upper levels or within a series of variants) part of the page template. Search for an article titled "Feed Duplicate Content Filters Properly" and look into this papers:
Detecting duplicate and near-duplicate files
Detecting query-specific duplicate documents [patft.uspto.gov]
[edited by: SebastianX at 8:07 pm (utc) on Oct. 13, 2005]
I wonder if Google has heard of the Anderer patents.
Spectral analysis anyone?
If you feel you have a duplicate content filter on your site then what do you do to get the filter off?
I mean after the content is cleaned up with removal, noindex tags, robots.txt, or whatever do you just wait for Google to spider your site again and wait for the next index update or do you write a email letter similar to a re-inclusion request.
Thank goodness total banishment is not the problem but I have been hit hard with some type of penalty. Anyway what do you do once your content is un-duped?
Thank You - Joe
|stuff 4 beauty|
"As many have said, you've taken down many good content rich, spam free, code clean, and spider friendly pages. I ask for what? We follow your TOS, I know I have code that validates and now you ask ME to provide you with a sitemap. Why? All the other search engines recognize my site and site's like mine and rank them accordingly, and even rather more quickly. I have one page, my index page which still ranks well, but all my other content is gone and buried so deep there's no bother looking. Do I have duplicate content NO? Do I use a navigation structure in template form to make surfing for my user's easier -yes. Doe's anyone else feel a dup content penalty for template pages may have been applied (although all meta tags etc, have been changed along with H tags as well)?"
The same thing happened to me - index page is still showing fine, all other pages buried. I have the same nav structure in the pages, but do have dif H and meta tags on each page....
>I wonder if Google has heard of the Anderer patents.
Probably: ivory-tower types enjoy a good laugh as much as anyway, and Anderer is already a hissed byword in the Linux community.
Hey Coos -
Our problems could also be from our template pages invoking a dupe content filter, though I think templates alone would not be the problem.
It does make sense for G to assume that almost all junk sites will use templates and have many pages, thus they are probably applying tighter standards to template sites with many pages. Those of us who are legit site are collateral damage and I hope they are working to fix it.
I favor an interactive manual review process (charge us if they need to) and a volunteer registry of site owners as efforts to reduce the number of good sites that are now filtered to death and facilitate contact with legitimate publishers. The index suffers *greatly* from the omission/downrank of great pages.
I see this in my sector -travel- and also as a heavy online research user in other areas, especially commercial areas.
I'm almost positive now the filter does not apply to a themed site using the same navigation on each page. I've found, at least for my site, the filter was trigged when someone copied part of a page of mine in answer to a question on a forum.
In order to keep my "original content" I could take a snapshot of the page, crop the stolen content, turn it into a photo, and then put it back up while the offending site is using my text.
I haven't done it yet, I want to give G the chance to fix the problem, but I can't stand to suffer forever because of this major problem in the algo. It should surely be easy to compare dates of origin before imposing a penalty.
I have positive evidence that the dupe content penalty can apply to templated pages within one site.
It's all due to the percentage of repeated content.
I run a database-driven advertising site with user-generated details of tens of thousands of individual unique records, which are genuinely different items, quite hard to explain without breaching ToS.
As the individual items are all potential landing pages, until recently I had a lot of explanatory text on these pages, which were generally #1 for their relevant specific key phrases.
About 3 weeks ago all of these pages dropped out of sight - even though my PR7 home page retained its #2 out of several million position on the site's two-word key phrase. (ie it was the pages, not the site, that were being treated differently or "penalised")
They had almost all been reclassified as supplemental pages.
I chopped out a lot of that repeated, templated text as an experiment. The pages are being respidered and reindexed, not as supplemental pages, and there's been a parallel gradual return of position on a selection of the old key phrases.
The site is not yet back to what it was (loss of 75% of traffic, followed by a 10% increase from the low point) but it seems like pretty clear evidence to me. I was amazed, incidentally, how quick the turnaround was once I'd figured what was wrong.
I have an ecommerce gift site all is hard coded in html (around 1000 pages). I use 2 templates for the products pages and categories. each product gets a unique description as I have a copywriter in full time job. But when taking the navigation links and other stuff that has to appear on each page the real content is hardly 15%-10% of the page. Not mentioning that in similar products you use the same terms no matter what you do. I would love to know how to handle this situation. as I am down in serps(and if itis not because of this it might pop up on the next update)
Dupe content filter is another blow to open source projects that need mirrors . .. ... ....
|I'm almost positive now the filter does not apply to a themed site using the same navigation on each page. I've found, at least for my site, the filter was trigged when someone copied part of a page of mine in answer to a question on a forum |
I've put a noindex tag on all the articles I've written that have been copied in forums or fully on a web page by others. It's kind of pathetic when the author of articles has to noindex them or take them down in these circumstances. But I don't want to take the chance of a dupe penalty like my small site had temporarily during Bourbon. Even though it got corrected it may not next time.
"I'm almost positive now the filter does not apply to a themed site using the same navigation on each page. I've found, at least for my site, the filter was trigged when someone copied part of a page of mine in answer to a question on a forum.
In order to keep my "original content" I could take a snapshot of the page, crop the stolen content, turn it into a photo, and then put it back up while the offending site is using my text."
Does this mean there's no point trying to get your 'stolen' content removed from the site where it's copied? I have only recently noticed that content of ours is being reproduced - in big chunks sometimes on blogs etc. Also only recently become aware of dire consequences this could cause!
Any Advice appreciated
Just yesterday I had one page of my content taken down and it was a matter of a simple word to the webmaster with a bit of legal jargon thrown in there, but still have more pages with the same problem. Yes, I could noindex them, and for my rather small site I may be forced to do just that. Not 1,2, or 3 emails to Google have been answered, not even the curt canned response.
Back to themed navigation sites - GG always said build a site for your visitors (and they will come). Well, I did just that and found the opposite after this update. My nav. structure is FOR my visitors - it's a small enough site that and themed so rather tightly that offering my visitors the ability to go from any page to any page just makes sense. Do I use keywords for those links - yes but again they make sense.
However, Google being nosensical at this time is just sad....
I built the field, but they now quit coming. G traffic is off about 60% - lets see now. Is there anything a competitor or anyone can do to affect your site. My answer is 100% YES.
Blog content aggregators by nature are reposting my entire site under their URL and crushing me in the results, ranking way above my site.
I origianlly thought blogs were good cause of the blog indexes. And Google liked blogs.
Now Google hates blogs and blog results are out of favor.
Google has way too much power.
[edited by: sore66 at 4:06 am (utc) on Oct. 29, 2005]
How do you find how much of your e-commerce sites and blogs are copied? Copyscape only checks on web page at a time and I have 100's of pages to check out. Any cool new tools out there for doing the whole contents of the URL in one swipe?
twalton, Of course it't worthwhile to write and ask the webmaster to take down the content they have copied from you. Some people are just cluless and don't know you can't copy anything on the web. Sites that know good and well what they are doing will often take the dup material down because they don't want trouble. After that it gets more involved. Some people threaten legal action, etc. I tend to give up and just put in a noindex tag or change my article quite a bit.
ronin, Here is what I do. I take a phrase from the article that would very unlikely be anywhere else. I put it in " " so I am only searching for the whole phrase. That will usually pull up copies of the article. It's a real pain though so I don't keep up on checking for dup content very well.
|Blog content aggregators by nature are reposting my entire site under their URL and crushing me in the results, ranking way above my site. |
I origianlly thought blogs were good cause of the blog indexes. And Google liked blogs.
Now Google hates blogs and blog results are out of favor.
Google has way too much power.
Wikis are, I assume, the natural progression of the anybody can edit something philosophy ..
Maybe we need the Google Wiki that would be fun.. ;)
Robot Wars Mark 2
There is a lot of speculation about whether or not having a "templated" right- or left-sided nav bar on every page in a site (purportedly good for usability) is causing an internal cross-linking penalty independently of any other issues which may be filtering your sites under Jagger.
However, you cannot just view the nav bar question as an internal variable alone, since, as many others are reporting, many of my sites, too, are being traunced in the SERPS by thousands of scrapers. On closer inspection I'm seeing that these scrapers actually copied my vertically listed nav items into a text string which Google displays in the SERPS as the description text for those scrapers' pages.
So, as long as nav bars by themselves are not causing in-site cross-linking penalties then a possible fix to throwing off the possible EXTERNALLY generated scraper dup content filter, might be to periodically change the words in your nav lists. Or, if your nav bar is keyword-loaded, you may need to change the order of the words periodically.
>There is a lot of speculation about whether or not having a "templated" right- or left-sided nav bar on every page in a site (purportedly good for usability) is causing an internal cross-linking penalty independently of any other issues which may be filtering your sites under Jagger.
I do not believe that a template nav bar incurs penalties, and I'd hate to see sites destroy good navigation out of simple paranoia.
I built a site with a few thousand pages, over 90% of which were basically a fairly simple database entry wrapped in a navigation-rich template. The size of the database entries -- sometimes as few as a dozen words -- were dwarfed by template material.
In other words, from a autopage-generation point of view, it's not all that different from your typical database-driven catalog.
So I've wondered if it would be depressed in the search results (NOT penalized, as that's an entirely different concept!) by Google's attempts to rank more-unique sites higher than those millions of duplicate drop-ship doorway sites masquerading as independent catalogs. And there may have been times in the past when that was true (I don't worry about it all that much...certainly not enough to CHANGE it or anything) but currently the site seems to be doing as well as it ever has.
annej - thanks for your reply. Do you think it would be worthwhile/a good idea to report sites that don't remove copied content to Google?
Over the last week i am thought about removing sections of my site - that might contain 100+ similiar template pages, but good content - into a completely new domain using a 301 redirect. With a fresh start it might rank again in a year.
Or is this crazy - i have never tried redirecting to another site before.
longen, templated sites per se is not the issue. You may want to go back and re-read the thread.
The issue occurs when too many pages within a site are deemed too similar. This can easily become a problem with templated sites for obvious reasons. But templates are not the problem. We all know of many templated sites that, with different and unique content on each page, are thriving.
|Do you think it would be worthwhile/a good idea to report sites that don't remove copied content to Google? |
I had the impression that Google doesn't remove pages on this basis. You would have to go through a legal copyright process. I think the next step would be to ask the company that hosts the website to remove the offending the site. Also you could have a lawyer write a letter to the offending website. I've never gone beyond what I listed earlier. Hopefully someone can give you better information on what to do next.
|On closer inspection I'm seeing that these scrapers actually copied my vertically listed nav items into a text string which Google displays in the SERPS as the description text for those scrapers' pages. |
yes, yes, yes...I can also go to the scrapper page indexed by Google and not see a single word or reference to my site on their pages, but they have grabbed all the keywords. I'm going to try your advice on changing nav titles every once in a while to see if this makes a difference.
longen - I have found that when you have approx. 50 template pages google is ok with them. When you get over 50 similar template pages, google flags them as too similar of pages.
1.)Before setting up any redirects to a new site I would try to decrease the number of templated pages. Are you templated pages dynamic with slight variations of keyphrases or keywords?
2.)Maybe you can also introduce more types of variating words within the templated pages.
You may want to setup 301 redirects for ophaned pages. You can also use the google auto-removal tool to get out page names / urls that you don't want too see again for at least 6 months. Be very careful using this tool as you can really screw yourself, and if you do you'll probably have no choice but to go to a new domain at that point.
I've found the auto-removal tool as a good resource, but a very dangerous one.
3.) Lastly be very careful about creating new page names, or varying the urls. You may end up inflaming the issue and getting duplicate content into the supplemental index and then you'll be you know where without a paddle.
Most the pages in question have gone URL only. The problem seems to be the nature of the structure/content:
Every page has 18 headings as follows
<titles> are similiar too.
I think that if content is meager then template design can be a tipping point to penalties.
It seems as if your results may be found if you click
"repeat the search with the omitted results included." when you do a site:www.domain.com.
If your results can be found in these pages, then the very first thing that you must do is make sure that all of your pages have different titles in them. Try this first, it just may bring your pages in.
As a goal you want to make sure that all of your pages have unique titles and descriptions, even when you have hundreds or thousands of pages.
Thanks, you just switched a light on for me.
Do you think that Google would consider a numeric ID within a title as evidence of uniqueness? I have always excluded unique reference numbers from titles on the basis that this would be diluting the power of the keywords, not particularly relevant to users and aesthetically not very pleasing.
If you think it would make the difference I would get over my objections...
Or would it be better to use some other part of the actual data from the fields in the db to help build a unique title?
I think that a numeric ID is appropriate, the best example may be a product ID number. Or SKU for the product.
This way you are making your titles/descriptions unique while still providing the customer with valuable information.
If you have repeat business and there are customers who buy "big red widget 145 with round edges" every other month, then it may make sense to put 145 into the title...assuming 145 is the product id number.
Most of the <titles> are 99% similiar - for new sections i'm developing i will use "full" titles at Menu level to draw traffic, but make Page titles unique to avoid duplicated data.
This isn't as informative for users, but avoids problems.
| This 154 message thread spans 6 pages: < < 154 ( 1 2 3 4  6 ) > > |