| This 154 message thread spans 6 pages: < < 154 ( 1 2  4 5 6 ) > > || |
|Duplicate Content Observation|
Some sites are losing ALL of their relevant pages
| 7:05 pm on Sep 29, 2005 (gmt 0)|
We've just done a whole bunch of analysis on the dup issues with G, and I wish to post an observation about just one aspect of the current problems:
The fact that even within a single site, when pages are deemed too similar, G is not throwing out the dups - they're throwing out ALL the similar pages.
The result of this miscalculation is that high quality pages from leading/authoritative sites, some that also act as hubs, are lost in the SERP's. In most cases, these pages are not actually penalized or pushed into the Supplemental index. They are simply dampened so badly that they no longer appear anywhere in the SERP's.
The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down.
From early Feb through about mid March, the situation was corrected and the best pages for specific queries were again elevated to higher rankings. When indented listings were involved however, the indented listing seemed now to be less relevant than was the case pre-Dec.
In mid March to about mid May, the situation worsened again, approximately to the problems witnessed in mid Dec., i.e., the most relevant pages dropped 5-10 spots, indents vanished as was the case in Dec.
But the most serious aspect of the problem began in mid May, when G started dropping even the best page for the query out of the visible SERP's.
A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread [webmasterworld.com] I've ever seen.
Why This is Such a Problem
The short answer is, that a lot of very useful, relevant pages, are now not being featured. I'm not talking about just downgraded. They're nowhere.
Now, I'm sure that there are sites that deserved the loss of these vanished pages. But there are plenty of others whose absense is simply hurting the SERP's. There is a difference between indexing the world's information, and making it available after all.
|Hypothetical Example |
We help a client with a scientific site about insects (not really, but the example is highly analogous). Let's discuss this hypothetical site's hypothetical section about bees. Bees are after all very useful little creatures. :-)
There are many types of bees. And then there are regional differences in those types of bees, and different kinds of bees within each type and regional variation (worker, queen, etc). Now, if you research bees, and want to search on a certain type of bee - and in particular a worker bee from the species that does its work in a certain region of the world, then you'ld like to find the page on that specific bee.
Well, you used to be able to find that page, near the top of the SERP's, when searching for it.
Then in mid Dec, you could find it, but only somewhere in the lower part of the top 20 results.
Now, G is not showing any pages on bees from that site. Ergghh.
What is an Affected Site To Do?
One option, presumably, would be to stop allowing the robots to index the lesser pages that are 'causing' the SE's to drop ALL the related pages. But this is a disservice to the user, especially in an era when GG has gone on record as taking pride in delivering especially relevant results, and especially for longer tail terms.
Should we noindex all the bee subpages, so that at least searchers can find SOME page on bees from this site? (I'm assuming that noindexing or nofollowing the 'dup' pages that are not really 'dup' pages at all would nonetheless free the one remaining page on the topic to resurface; perhaps a bad assumption.)
In any case, I refuse. Talk about rigging sites simply for the purpose of ranking. That's exactly what we're NOT supposed to be doing.
G needs to sort this out. ;-)
Note: Posters, please limit comments to the specific issues outlined in this thread. There are a lot of dup issues out there right now. This is just one of them.
| 9:46 pm on Oct 3, 2005 (gmt 0)|
"That said, nothing in your post above feels necessarily wrong to me. In fact, it may be that what I'm theorizing is a part of a new algo that you are brave enough to wonder aloud about...
The one aspect of the new algo post I disagree with is the conclusion that "beta issues than mature algo tweaks." I believe that that statement is ONE potential view of the problems wer'e seeing."
Yes, I agree, it's entirely possible that what we're seeing are not beta issues, but an algo that is simply much tighter, with many more switches and configuration options, all of which can be tested more easily.
For example, in the case of ALL pages disappearing, that could simply be flipping the 'all pages disappear' switch, see what happens, watch WebmasterWorld feedback, watch search data, click throughs etc.
If I write a new app, I put everything I've learned in the past into it, and I try to include the greatest possible range of options and switches into it so I don't have to redo it. This would be based on previous experience.
I'd drop it to the very simplest level: how long can you add features and tweaks to a huge app before you need to do a full rewrite? If you look at windows for example, it was 3.x to 9x, then NT to XP. Roughly 7 years between major rewrites that is. Especially when the fundamental architecture for that app was developed with almost no real world experience by two guys in their early twenties. Twenty something programmers always make the same type of mistakes, but at some point those mistakes become unsustainable and the thing has to be redone. Look at Netscape 1-4x versus Firefox for example.
So rather than ask if they have, I'd ask, how could they not do this if long term growth and survival is in question? I'm seeing far too many new switches being turned off and on this year to not see a new machine behind the curtain. And the switches are working very well.
If I look at for example the dupe thing, in the past there was from what I can see basically just one big switch, now I see more subtle gradations, multiple switches that allow fine tuned control. Which is exactly what I would do if I were creating a new app based on past experience.
You've always pushed a very empirical methodology, which I always learn a lot from, but I can't resist seeing if I can make a big picture that can take as much as possible into account...
[edited by: 2by4 at 10:03 pm (utc) on Oct. 3, 2005]
| 9:47 pm on Oct 3, 2005 (gmt 0)|
This was taken from someone at Matt Cutts:
We use titles and descriptions in our sub sections to introduce contents of our articles which is the same as the title and description on the top of our articles and related articles as well as the meta title and descritpion.
Our site and a hand full of other sites with similar structures all have had problems. This is my best guess of what is causing the problem. I am just guessing that Google is finding the duplicated internal text/anchor text and deciding which one is good. In most cases, it is picking low ranking sub-section pages and filtering out the real articles.
| 9:55 pm on Oct 3, 2005 (gmt 0)|
Well the mathematician in me says that all other things being equal (may not be, we don't know how far along this non-update update is [I think about a month+, but that is just me]) if you find a counter example that rebutts the theory, you must abandon or modify the theory.
I can provide a theory, but it is outside the scope of this tread.
| 10:01 pm on Oct 3, 2005 (gmt 0)|
>I'd drop it to the very simplest level: how long can
>you add features and tweaks to a huge app before
>you need to do a full rewrite?
That's correct - but remember that Google doesn't have to go either/or, it can run both algos at the same time, and just weigh one or the other stronger. Similarly, it has the possibility to play with the settings on the cached set of sites for internal tests. I'm certain they do that before any small/medium/large changes: run a test group through Google and let them tell the engineers which setting produces better results. ("taste-testing" for search engines) And they certainly run the betas out for testing on a large scale (check the number of posts on WH / BH sites regarding a change, the higher the BH:WH-ratio goes, the better the changes?).
So perhaps what we're now seeing is a small scale-up from one algo to the new one.
One thing that certainly is a problem is that Google has such clout that the websites are largely at it's mercy. If it decides to drop affiliate sites completely (say it could), then all those that depend on those sales will suddenly have a big drop in sales, because Google "feels like it would be better". Ouch. In the woest case, lives down the drain because of taste-tests. I guess that's life for a web-marketeer :-).
Has anyone tried to model a new SE based on Googles serps? It would be an interesting experiment (and a ton of hard work)...
| 10:05 pm on Oct 3, 2005 (gmt 0)|
wiseapple, yes, internal site dup filters are nothing new...and they can relate to all sorts of things. I noted the possibility in this thread [webmasterworld.com] some time ago that templated sites may face problems ... and at the time, took a bit of grief for the notion. Don't take grief for that comment any more. ;p
2by4, right now, nothing in your theory falls apart AFAIK. Possible exception: Brett's ear is close to the ground and neither he nor G has suggested that what happened on or about 9/22 was an update.
Then again, the definition of 'updates' seems to grow cloudier over time, no? Well, for me anyway. ;-)
| 10:10 pm on Oct 3, 2005 (gmt 0)|
"Brett's ear is close to the ground and neither he nor G has suggested that what happened on or about 9/22 was an update."
Don't forget last november or thereabouts, that same ear said: look for very large changes.
As the poster above said, it's quite possible they've been switching back and forth, doing testing etc.
I don't want to get into one of the main points that suggest conclusively that this switch happened thereabouts because I don't want the thread to go off-topic, and the arguments that question raises always tend to degenerate into shouting matches, but long time readers here will know what it is.
<added>If I had google's resources, and were rewriting this app, I'd include the option to essentially run the old system in the new one, like windows supporting 16 bits dos apps, and have a switch so I could switch a data center from one to the other easily.
[edited by: 2by4 at 10:17 pm (utc) on Oct. 3, 2005]
| 10:10 pm on Oct 3, 2005 (gmt 0)|
Miop, without going into detail, if you look at 2 of your URLs as listed on Google, one fully indexed, one not, you must admit the contents looks pretty similar:
Would you not count them as duplicates? Without looking at the rest of your site (PM me if you're interested), I would venture a guess that this is what bit you. Not 100% duplicates, but pretty close - it would fit into my theory of block-level duplicate detection :-).
I'm seeing a lot of this lately, that and combined with Google indexing lots of session-ids lately, can easily spell the death to sites that don't really contain duplicates but could have multiple URLs pointing to the same content. Any yes, I've seen this hit lots of forums, specifically because of the multiple versions of the pages all being indexed... :-( Perhaps sites of a certain "value" are immune to this game? (large forum sites, Amazon, etc.)
| 10:20 pm on Oct 3, 2005 (gmt 0)|
[ Miop, without going into detail, if you look at 2 of your URLs as listed on Google, one fully indexed, one not, you must admit the contents looks pretty similar:
Would you not count them as duplicates? Without looking at the rest of your site (PM me if you're interested), I would venture a guess that this is what bit you. Not 100% duplicates, but pretty close - it would fit into my theory of block-level duplicate detection :-). ]
That is actually just one product page - the product number is 1498, section number 247 or 257 as the product is linked to from two different sections. I hadn't thought that listing a product in two sections could cause the product page to be considered a dupe, because there is only one product page, but then I read elsewhere that a site may be spidered by more than one googlebot at the same time, making it appear as if it is two identical pages.
The product is listed in two sections to make it easier for the customers to find depending on what they are looking for - another thing I'm going to have to dismantle to please the SE. :(
For other items, it is simply that they are so similar to the item descriptions for other items e.g. black pvd widget/gold pvd widget, or 1.2mm gauge v 1.6mm gauge that the pages are 90% similar and due to the template system and product similarity I can't seem to get it any lower. Having said that, it was not a problem until July - most product pages were listed back then, and it's not a problem in other SE's.
I think I'm going to have to go right back to basics and start again...
Thanks for looking! :)
| 10:40 pm on Oct 3, 2005 (gmt 0)|
[I hadn't thought that listing a product in two sections could cause the product page to be considered a dupe, because there is only one product page, but then I read elsewhere that a site may be spidered by more than one googlebot at the same time, making it appear as if it is two identical pages.]
In fact because of the way the templates system works, it *is* two identical products even though the product only exists on one product page - penny just dropped.
Many thanks for your assistance.
| 10:41 pm on Oct 3, 2005 (gmt 0)|
with time it looks like only PRO webmasters will be able to create real websites, because what has all happen on google from 301 to filters.
| 10:50 pm on Oct 3, 2005 (gmt 0)|
[ with time it looks like only PRO webmasters will be able to create real websites, because what has all happen on google from 301 to filters. ]
Give it a couple more years and we'll all be PRO webmasters. :)
Are there any college courses on Google SEO yet? I bet it would be *packed*.
| 11:28 pm on Oct 3, 2005 (gmt 0)|
ok, let me put my devil's advocate hat off and step back behind caveman with some research i started, triggered by this thread:
for "friends and family" i have around 20 gallery sites up and in dmoz. mostly personal sites, birthday and wedding related.
i just looked at them with the "site:" command. almost all had to move to another server about 5 weeks ago.
all have PR3-4
all show "Supplemental Result"
I guess this could be either another bug in the dupe content filter, or Google is not happy at all that I have moved the domains to a new server, even if the nameserver did not change!
So, back to the initial posting of caveman: I believe as well, that the filter is too harsh, too crazy to provide a good index.
If you run the open source GALLERY program: check your domains, if they show dupe filtering! I would be interessted in your observations.
All these GALLERY pages are the same, besides the picture name, right?
| 12:08 am on Oct 4, 2005 (gmt 0)|
I recently did some experiments with gallery stuff.
First question, are you using query string? urls, or are you using mod_rewrite in the url?
A few months ago I experimented on this very thing, and instantly realized that the query string gallery pages did not work, all supplemental, but mod_rewritten stuff is correctly handled as unique pages.
Easy to see why with galleries, just a bunch of image links, most text is the same except a blurb or two.
| 12:26 am on Oct 4, 2005 (gmt 0)|
uhmmm, no mod_rewrite on these... that might be the problem, but the albums are taken constantly and right by the bot. It is very weird, that the bot is making a difference on URLs, that just "look" different, but it both can swallow.
I will put a .htaccess there and look, what will happen anyhow...
| 12:54 am on Oct 4, 2005 (gmt 0)|
All I can tell you is that when I switched to mod_rewrite and dumped the? urls for standard looking urls, the pages:
1. Got PR
2. Started ranking quite well for their gallery topic areas
3. Started bringing in fresh traffic to the site, in other words, users looking for things the galleries are about are now finding it.
Sounds like win/win/win to me...
However, to be clear, this may or may not be related to the initial dupe content matter caveman raised, although the time frame is roughly similar, around march I think is when I switched them.
At that point, these simple single variable query string urls were most definitely not ranking in any way, and were not being treated as unique pages, which is when I started ignoring completely what search engines say about query string urls, and also when I completely stopped using them.
| 1:03 am on Oct 4, 2005 (gmt 0)|
It's very interesting, but IMO no, it's not related to the main issue of the thread.
Perhaps worthy of a new thread 2by4? ;-)
| 1:07 am on Oct 4, 2005 (gmt 0)|
Nah, no new thread is needed, I don't actually have any questions about it at all since it's so easy to see that it worked perfectly, the site is ranking great for everything I throw into it.
I think it just sort of comes under the general observation I made earlier that what I'm seeing increasingly is that google's tolerance for what I think is correctly called 'sloppy webmastering' has dropped dramatically.
I need to look more into the duplicate issues you raised though, I'm still not completely clear on what the trigger parameters are.
But let's look at your initial point again:
"G is not throwing out the dups - they're throwing out ALL the similar pages."
This is precisely what I was seeing, all the similar pages were being thrown out, since it was a templated format.
| 1:14 am on Oct 4, 2005 (gmt 0)|
> I'm still not completely clear on what the trigger parameters are
Hehe. Join the club.
Or, as cavegramps used to say, "Get used to it."
| 1:25 am on Oct 4, 2005 (gmt 0)|
we'll just have to keep watching...
"It began surfacing on or about Dec 15 or 16 of last year"
This is the precise time frame I was looking at, I believe this is when the new algo started coming online. Another poster here confirmed that something he's noted also begin in december. I don't think any of this is coincidence, not at all.
Spinoza said that belief in chance simply reflects an inadequate understanding of the circumstances.
| 1:28 am on Oct 4, 2005 (gmt 0)|
Since caveman narrowly defined the thread, I haven't seen anything that fits the criteria, but I'm wondering about this "no pages rank" thing.
Suppose you use the Google SiteSearch thingee. If you swipe in a bit of text, this should normally return results from the single domain, ranked as if the single domain was the entire Internet. Are people suggesting no results would appear for their effected pages?
Going off-topic a bit to explain... when I do this for my recently "lost" domain, the first result is a Supplemental listing deleted more than a year ago. The second listing is any other Supplemental that mentions the word(s). After that, the results are normal.
Most notably there is a technical word in my niche that has no meaning anywhere else. This word is on fifty+ pages on the site. When searching via the SiteSearch, the long-gone Supplemental appears first, even though this technical word only appears once in the body of the page... meaning that one mention outranks my detailed page devoted to that technical word with a URL like example.com/technicalword/ (where technicalword is the technical word, not two words scrunched together) despite almost all the other instances of the word appearing on the site being links to the detailed page.
I'm wondering if people are experiencing something similar, where one Supplemental result can kill every ranking for anything where the words appear anywhere on the Supplemental page... but this would not be the case as described in the thread if no pages show up for a search.
| 1:48 am on Oct 4, 2005 (gmt 0)|
|The current problem is actually not new IMHO. It began surfacing on or about Dec 15 or 16 of last year. At that time, the best page for the query simply seemed to take a 5-10 spot drop in the SERP's...enough to kill most traffic to the page, but at least the page was still in the SERP's. If there were previously indented listings, those were dropped way down. |
A few days ago, the problem worsened, going deeper into the ranks of high quality, authoritative sites. This added fuel to what has become the longest non-update thread I've ever seen.
By a not strange to me coincidence, this is precisely the issue we're having right now on at least one site. To the letter. One of our sites, which I think could fairly be declared at least a hub site, just experienced this on a single search term.
It sounds to me like once certain groups of pages are being determined dupe then dropped, those pages may drag down the remaining pages slightly for that specific search term, without any visible penalty being applied. I've been seeing more and more signs of google treating sites like sites, not unique urls, for a while now.
Keep in mind that this behavior STRONGLY indicates a brand new dupe component, since the old one, as Brett has always pointed out, merely took the dupe page, tried to determine the original page, then dropped the dupe out the serps. The site was not affected.
What caveman is looking at here, and I believe what I'm looking at on at least one site, is a brand new way to handle duplicate content, completely different, and significantly more... let's say: interesting, although of course many would happily not experience it, for obvious reasons.
| 2:47 am on Oct 4, 2005 (gmt 0)|
I dunno if this helps, but my other site with plenty similar product issues has barely suffered from the dupe content thing - it's an OS commerce site, so a product that is repeated in more than one section still ony (really!) has one product page. At least for my site, it looks like the template system is the main problem (mixture of php and html). Hope that helps somebody!
| 3:45 am on Oct 4, 2005 (gmt 0)|
We are having an internal duplicate content issue as well, it stems from the middle of December time frame as mentioned by some others.
Our problem is unique though in that because we are hosted on a windows platform, urls are not automatically converted to lowercase letters. So urls are apparently being considered unique if the have all lowercase and another version with some letters in uppercase. Hence two unique urls with the same content tripping a duplicate content penalty and bam, there goes the rankings.
The cause of the problem is ours and not a deliberate one, but fixing it has become a whole different matter. One where the solution isn't clear without some advise from Google on what we can do to get the urls with capitals removed from their index. The problem is, the all lowercase url and the one with uppercase letters in it are ultimate coming from the same file. Nuking one with the removal tool might also cause us to be nuking the correct one. I e-mailed Google about 10 days ago informing them of the problem we discovered and asking how best to get the incorrect urls out of their index with removing the correct ones with all lowercase letters in the url. I still have not gotten an answer.
Unless we get an answer I'm afraid we will have to at least take a shot at trying something on our end and hope it works. If it goes wrong though, I guess we will have to be prepared to feel like we are in Siberia for 6 months to a year. It's their index, I just wish they would give us some guidance so that the problem can be resolved in a professional manner.
| 4:00 am on Oct 4, 2005 (gmt 0)|
We had exactly the same problems, site on IIS, couldn't implement mod_rewrite 301s etc, we moved it to Apache, reprogrammed the asp stuff, implemented all the necessary 301s and the site came back quite quickly. Obviously, we also made all the urls correct etc before moving the site. This resolved that particular issue, but it's possible we're still seeing some sideaffects of that problem currently, it's hard to say, but I can't help wondering if some of those old pages are still causing us problems.
| 4:14 am on Oct 4, 2005 (gmt 0)|
Yes, my thoughts are that unless we can get them removed from the index via the removal tool, they are going to continue to be a problem at least for the forseeable future.
We are looking at an option that would at least allow us to present a 301 from the urls with uppercase letters in them, but that might not do enough because this internal duplicate content problem doesn't seem to matter even if one page is in the main index and another is in the supplemental results index, you still are being penalized for both pages.
| 4:46 am on Oct 4, 2005 (gmt 0)|
Create temporary pages of the pages you want removed, make the pages have only this content:
Response.Status="301 Moved Permanently"
Response.AddHeader "Location", "http://www.yoursite.com/page-you-want-bad-address-to-redirect-to.asp"
Leave it up for a long time, forget about them. This page, again, will have the file name and folder location of the page you want removed.
This should resolve at least that particular issue.
Of course make sure that all your real links on the site correctly reflect the actual file name you want indexed.
| 4:58 am on Oct 4, 2005 (gmt 0)|
steveb, what you refer to, as you surmise, is a different issue. We can pull up the lost pages using the G Site Searach thingy, and the page most relevant to the specific term searched on comes up first when using that tool, often with no Supp results in sight. We just don't see the relevant pages in the real SERP's.
Also for steveb: Not wanting to stray more OT than some of this thread has gone, to what do you attribute the issue you're noting? Basically external dup content? I ask because it might help people distinguish between the two similar, but quite different, problems. (I know the issue to which you refer has been discussed a lot in the non-update thread.)
| 5:10 am on Oct 4, 2005 (gmt 0)|
Some domain-wide threshold of external copies.
The existence of Supplementals for your domain.
The fact that Supplementals are *older* than current copies on your domain... basically by definition any page you have moved in the past will be older (meaning indexed before) your current location. ---- Another way to think about this is previously Google has stated pagerank was a deciding factor in choosing a canonical page... in this case that is 100% not the case; instead, the choice of the canonical page is made strictly based on age.
| 5:34 am on Oct 4, 2005 (gmt 0)|
>> the choice of the canonical page is made strictly based on age.
I probably agree with this about 80%. Not knowing what has happened to your sites, I can however relate to what happened with my site.
1) Site doing very well -- well aged and well linked to.
2) Site gets scraped.
3) Site has a bad robots.txt on it so ranked pages go supplemental
4) Scraper pages remain
5) Robots.txt replaced
6) Old pages refuse to rank
7) Change the directory of the old pages. Old pages return a 410 to make sure the proxy type scrapers do not get the content on the same URL.
8) New URLs are back to the old ranks
This either indicates that the penalty on my site was not sitewide -- but the penalty is applied in a tree like fashion... with the penalty going up the tree as dupe clusters are determined. If the dupe clusters go all the way to the root node the entire site gets hit.
This filter was applied as a one time process and I got my rankings back by accident and they will be hit again the next time the process is run. If this is the case, I'm screwed.... given that this was one of my "lets do this #FFFF88 shade of gray" sites.
If there is someone out there who has just had one directory that has been hit .. have you tried moving the content?
| 11:03 am on Oct 4, 2005 (gmt 0)|
All the previous versions of my website still sit below each other on the server.
(I copied/overwrote files, but never deleted any. In each new version, I just removed the links to deprecated pages.)
Is it good so, and if not, what do you advise I should do?
| 11:19 am on Oct 4, 2005 (gmt 0)|
one more thing you could do is to check the correct url within your code.
In some kind of pseudocode:
if (thisPageURL <> lowercase(thisPageURL))
I did kind of that for my pages and it works fine. (My pages URLs are created using the headline. So if the headline changes, URL will change too. To avoid dupe content I'll redirect the old URL to the new one.)
| This 154 message thread spans 6 pages: < < 154 ( 1 2  4 5 6 ) > > |