Forum Moderators: Robert Charlton & goodroi
I have a forum site that has existed for about a 18 months.
About 14 months ago I changed forum software from SMF to vBulletin. The SERPs on the first 4 pages of Google, when using the site: command, are full of non-existent pages from the old forum. I've checked and all of these return a 404 header.
From about page 5 onwards, the new vBulletin forum has several thousand pages listed as supplemental - about 20 of which go into the main index for a few days every month - before reverting to supplemental.
The only pages that are listed consistently in the main index are the home page and two general information pages.
The site doesn't have a huge amount of incoming links, but it does have some reasonable ones - enough to any other sites that I've worked on indexed properly.
Yahoo and MSN have a full and current index of the site and it performs fairly well for most relevent search terms.
Any suggestions woud be gratefully received.
That is what Supplemental Results are. They are older content, removed content and duplicate content.
There is no way to control this action. You can safely ignore them.
What you should be measuring is how many pages of the current site are fully indexed. THAT is what is important.
Check out all that I wrote in this other thread: [webmasterworld.com...] and the threads that it mentions.
Over 95% of the content of the site has been added since I converted the forum to the new software. Some of the threads were converted from the old forum, but less than 5% remain.
It's the 'fully indexed' part which I am concerned about. Most of the new pages/threads are being added to Google's index, but they are all going into the supplemental index along with the 404 pages.
When I do a search for content on those pages between "commas" the pages are not returned in the SERPs. Although the pages do show up in Google's cache when I look at them after a site: command.
As I said, about once a month about 30 pages pop into the main Google index, but they only remain there for a day or so, before becoming supplemental again.
That is what can generate a duplicate content problem, and very likely is what is sending your urls into the Supplemental Index. At the very least, it can be a part of the problem.
Here's one thread where g1smd really tears into the issue:
[webmasterworld.com...]
I assume that also applies to 410 GONE?
It shouldn't. A 410 means the resource is GONE, kaput, patooey, ain't there no more. Once the bot receives that response, they should not return to reindex that URI.
A 404 means that the resource was not found. Most spiders will continue to index a 404 numerous times and at some point, will remove it from their index. In Google's case, it takes quite a bit of time (too much time from my perspective).
Pages removed should return a 410. Pages moved should return a 301 (or 302 in appropriate instances).
That's what I thought, too, but I have numerous urls still showing up in Sitemaps after a year+ and they are all returning 410 GONE, and even showing that way in Sitemaps.
They have been spidered numerous times throughout the last year, and many were spidered as recently as yesterday.
Very strange.
Very strange.
I may be wrong on this, and I'm off to find out this moment, but I believe Matt Cutts stated at one time not too long ago that they handled 404 and 410 the same way. Hold on...
Yup, here's a related discussion...
Returning 410 instead of 404
[webmasterworld.com...]
Best practice is to return the proper response. jdMorgan gives an excellent summary in the above topic of how this should be handled.
You ask on here, and no-one can see why they should be. You say that search engines never ask for that page, even though it has dozens of inbound links.
Then Matt Cutts posts and says, "Oh yes, just over 20 years ago that page returned a '410 Gone' response, so we will never access that file ever again."
You would think that daft. For something that is "Gone" of course Google needs to check the status from time to time just in case it is ever not-gone.
There is a world of difference between what Google checks the status of to see if it exists, and what they show in their index as actually existing.
In the same way, they do cache and keep a copy of pages that you tag as "meta robots noindex" but they do not show them in the public index.
They do that as they do need to keep a record of the status of the page. If they didn't keep a record of a page being "noindex" they would come back every day to see what the status was as if they had never accessed the page before. You would call that "broken".
The reason I asked about 410 GONE...
Last year I removed 4,500 pages using the removal console and set-up a 410 GONE response for those pages.
After 14 months, 38 of the pages are still listed in Sitemaps as "Errors."
I always go on the assumption that Google has a reason for what they do (no cat-calls, please), so I'm just curious why the remaining 38 haven't been dropped in case there's something different about them that I don't realize.
The pages are definitely gone and as far as I can tell, there are no external links pointing to them.
It's not a big deal, but I was just curious.
The reason to understand duplicate content issues is that vbulletin, in its default configuration, has a design flaw that allows every thread and every post to have at least 10 different URLs that will access the exact same content.
Sure, most forums are a bit like that. And I've taken the usual robots.txt steps to deal with the worst offenders.
But even where duplicate content is present, Google is fairly good about choosing one source of URLs to index, whilst assigning the duplicates to the supplemental index. And of course there are tens of thousands of extensively indexed vBulletin forums around the world.
The problem here is that not a single thread from the forum - not even the main boards - are in the main Google index.