Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google lists dozens of 404 files for site for a year.

Have been 404 for over 12 months - new pages supplemental.

         

bouncybunny

6:30 am on Sep 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi

I have a forum site that has existed for about a 18 months.

About 14 months ago I changed forum software from SMF to vBulletin. The SERPs on the first 4 pages of Google, when using the site: command, are full of non-existent pages from the old forum. I've checked and all of these return a 404 header.

From about page 5 onwards, the new vBulletin forum has several thousand pages listed as supplemental - about 20 of which go into the main index for a few days every month - before reverting to supplemental.

The only pages that are listed consistently in the main index are the home page and two general information pages.

The site doesn't have a huge amount of incoming links, but it does have some reasonable ones - enough to any other sites that I've worked on indexed properly.

Yahoo and MSN have a full and current index of the site and it performs fairly well for most relevent search terms.

Any suggestions woud be gratefully received.

g1smd

12:02 pm on Sep 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Supplemental Results for pages that are now returning a 404 error will continue to be listed for one year, and then dropped from the index.

That is what Supplemental Results are. They are older content, removed content and duplicate content.

There is no way to control this action. You can safely ignore them.

What you should be measuring is how many pages of the current site are fully indexed. THAT is what is important.

Check out all that I wrote in this other thread: [webmasterworld.com...] and the threads that it mentions.

bouncybunny

12:55 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that link. I read much of it, but it seems to deal predominantly with duplicate content.

Over 95% of the content of the site has been added since I converted the forum to the new software. Some of the threads were converted from the old forum, but less than 5% remain.

It's the 'fully indexed' part which I am concerned about. Most of the new pages/threads are being added to Google's index, but they are all going into the supplemental index along with the 404 pages.

When I do a search for content on those pages between "commas" the pages are not returned in the SERPs. Although the pages do show up in Google's cache when I look at them after a site: command.

As I said, about once a month about 30 pages pop into the main Google index, but they only remain there for a day or so, before becoming supplemental again.

tedster

1:08 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The reason to understand duplicate content issues is that vbulletin, in its default configuration, has a design flaw that allows every thread and every post to have at least 10 different URLs that will access the exact same content.

That is what can generate a duplicate content problem, and very likely is what is sending your urls into the Supplemental Index. At the very least, it can be a part of the problem.

Here's one thread where g1smd really tears into the issue:
[webmasterworld.com...]

jk3210

1:20 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<<Supplemental Results for pages that are now returning a 404 error will continue to be listed for one year, and then dropped from the index>>

g1- I assume that also applies to 410 GONE?

pageoneresults

1:51 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I assume that also applies to 410 GONE?

It shouldn't. A 410 means the resource is GONE, kaput, patooey, ain't there no more. Once the bot receives that response, they should not return to reindex that URI.

A 404 means that the resource was not found. Most spiders will continue to index a 404 numerous times and at some point, will remove it from their index. In Google's case, it takes quite a bit of time (too much time from my perspective).

Pages removed should return a 410. Pages moved should return a 301 (or 302 in appropriate instances).

jk3210

2:51 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<<Once the bot receives that response, they should not return to reindex that URI>>

That's what I thought, too, but I have numerous urls still showing up in Sitemaps after a year+ and they are all returning 410 GONE, and even showing that way in Sitemaps.

They have been spidered numerous times throughout the last year, and many were spidered as recently as yesterday.

Very strange.

pageoneresults

3:21 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Very strange.

I may be wrong on this, and I'm off to find out this moment, but I believe Matt Cutts stated at one time not too long ago that they handled 404 and 410 the same way. Hold on...

Yup, here's a related discussion...

Returning 410 instead of 404
[webmasterworld.com...]

Best practice is to return the proper response. jdMorgan gives an excellent summary in the above topic of how this should be handled.

g1smd

9:56 am on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Imagine you buy a domain, and put up a page called www.domain.com/this.product.html and after many months it isn't indexed.

You ask on here, and no-one can see why they should be. You say that search engines never ask for that page, even though it has dozens of inbound links.

Then Matt Cutts posts and says, "Oh yes, just over 20 years ago that page returned a '410 Gone' response, so we will never access that file ever again."

You would think that daft. For something that is "Gone" of course Google needs to check the status from time to time just in case it is ever not-gone.

There is a world of difference between what Google checks the status of to see if it exists, and what they show in their index as actually existing.

In the same way, they do cache and keep a copy of pages that you tag as "meta robots noindex" but they do not show them in the public index.

They do that as they do need to keep a record of the status of the page. If they didn't keep a record of a page being "noindex" they would come back every day to see what the status was as if they had never accessed the page before. You would call that "broken".

jk3210

12:10 pm on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[Clarification]

The reason I asked about 410 GONE...

Last year I removed 4,500 pages using the removal console and set-up a 410 GONE response for those pages.

After 14 months, 38 of the pages are still listed in Sitemaps as "Errors."

I always go on the assumption that Google has a reason for what they do (no cat-calls, please), so I'm just curious why the remaining 38 haven't been dropped in case there's something different about them that I don't realize.

The pages are definitely gone and as far as I can tell, there are no external links pointing to them.

It's not a big deal, but I was just curious.

g1smd

12:42 pm on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's correct then.

The 410 is an error code, and Google reports that the error code is seen and understood.

Presumably there are still some live links that point to just those, and they remain to be checked.

bouncybunny

12:46 pm on Sep 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The reason to understand duplicate content issues is that vbulletin, in its default configuration, has a design flaw that allows every thread and every post to have at least 10 different URLs that will access the exact same content.

Sure, most forums are a bit like that. And I've taken the usual robots.txt steps to deal with the worst offenders.

But even where duplicate content is present, Google is fairly good about choosing one source of URLs to index, whilst assigning the duplicates to the supplemental index. And of course there are tens of thousands of extensively indexed vBulletin forums around the world.

The problem here is that not a single thread from the forum - not even the main boards - are in the main Google index.