Removing old pages on your site - Issue a 301 or 404? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Removing old pages on your site - Issue a 301 or 404?

Use a 301 instead of a 404 to capture link weight when removing old pages?

errorsamac

2:49 am on Jul 22, 2007 (gmt 0)

I have a large site that has tons of old products that are out of stock and will never be in stock again. Those products do not get any traffic via search engines and I would like to just remove them because they are of no use to my customers (they are accessible now, but really buried and it takes work to find them).

My question is, should I remove the pages and just issue a standard 404 not found response, or should I issue a 301 redirect back to my www.example.com page to capture any potential external links that might be going to these pages?

I think the "right" thing to do is issue the 404 and just ignore any links that are going here, but the SEO side of me says I should issue the 301 redirect from /products/widget.html to www.example.com/ to funnel any potential link weight those old pages have from external links to the main index.

Halfdeck

10:33 am on Jul 22, 2007 (gmt 0)

It's up to you. If the site already has plenty of IBLs, and supplemental results isn't a huge problem, and the IBLs to those outdated pages are scarse, then 404ing them isn't such a bad move.

301 redirecting a product page to the home page isn't too user-friendly either.

g1smd

6:32 pm on Jul 22, 2007 (gmt 0)

I would generate a 404 and show the "section error page", a custom error page for that section of the site that then links to related content that still might be of value to the surfer.

errorsamac

10:40 pm on Jul 22, 2007 (gmt 0)

Real users do not hit these pages - it's exclusively bots. I don't think a real user has hit these pages in the 6 months.

It sounds like doing a 404 is the right thing, and throwing away any potential external links that might be pointing to these pages since the pages to the main index are worth more than to the individual products?

Marcia

12:15 am on Jul 23, 2007 (gmt 0)

A 410 with a custom error page.

jdMorgan

12:54 am on Jul 23, 2007 (gmt 0)

I'll second the 410.

404-Not Found means the page is missing and the reason is unknown.
410-Gone means the page has been intentionally removed.

Jim

Marcia

1:14 am on Jul 23, 2007 (gmt 0)

they are accessible now, but really buried and it takes work to find them

Googlebot can find them (and remember them) easier than human visitors can. 410 them and remove them from site navigation and the sitemap. In the event of any possible external links to any of those pages, making a custom 410 error document (which can be the same as the 404 error document) will be the polite, user-friendly way to handle that.

errorsamac

1:44 am on Jul 23, 2007 (gmt 0)

On the topic of 410 vs 404, I think Google treats them the same. Matt Cutts talked about it here:

[mattcutts.com...]

Here is the important info from it:

"The most useful tidbit ... is that Google treats a 404 HTTP status code ... and a 410 HTTP status code ... in the same way ... once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that�s probably a good call for the time being."

Thanks for the replies BTW.

Quadrille

1:51 am on Jul 23, 2007 (gmt 0)

If it's easy to set up, I'd consider removing the pages and using a 301 permanent redirecct to a page that says "this tem no longer in stock, here's a searc box and some useful links"

But I wouldn't sweat it; the 'risks' associated with just binning the pages are small - it'll all a matter of doing what is most helpful to the occasional visitor who may otherwise be confused.

Even with a (user friendly) 404, the pages will simply drop out of the index eventually, with no real harm to man or beast.

Marcia

2:06 am on Jul 23, 2007 (gmt 0)

I'll stick with 410 because it's what's accurate and I see no reason not to tell it exactly like it is. The pages aren't just not there for some reason (404) and they haven't been moved permanently (301), they've been removed and they're permanently gone - which is exactly what 410 is for.

Another issue is the sloppy way MSN Search handles 301's (which can't be ignored) - with a page saying 301 Moved in the cache, which usually shows up in a site: search at the very beginning. There's more than just Google in life for sites, and I personally see no reason not to do exactly what Apache documentation indicates is the correct handling for removed pages when it's easy enough to do.

Halfdeck

12:04 am on Jul 24, 2007 (gmt 0)

I actually like the idea of a 410. I think Googlebot may come back to a page few times if it sees a 404. But a combination of 404 and robots.txt will probably be just as good as a 410.

Marcia

12:16 am on Jul 24, 2007 (gmt 0)

Another thing is that ask.com isn't exactly the swiftest with dealing with gone or redirected pages. Who cares about Ask? People who like getting targeted traffic for the type of verticals that use them.

g1smd

7:44 pm on Jul 24, 2007 (gmt 0)

Don't block the URL in robots.txt as the bot will then never be able to see that they URL returns a 404 status.

Halfdeck

9:35 pm on Jul 24, 2007 (gmt 0)

"Don't block the URL in robots.txt as the bot will then never be able to see that they URL returns a 404 status. "

Ok, though the point, as far as Google is concerned, is to tell Googlebot a page no longer exists. Googlebot doesn't need to see a 404 or a 410 if the url is disallowed in robots.txt.

According to Vanessa Fox,

Content Removal:
Individual URLs

Choose this option if you'd like to remove a URL or image. In order for the URL to be eligible for removal, one of the following must be true:

* The URL must return a status code of either 404 or 410.
* The URL must be blocked by the site's robots.txt file.
* The URL must be blocked by a robots meta tag.

Marcia

10:00 pm on Jul 24, 2007 (gmt 0)

To be perfectly honest, what I'd do is a 410 for the URLs (which is the correct protocol for permanently removed pages, with no SEO shenanigans), and I'd also physically remove the page altogether.

That way, the page's content will no longer be available for fetching or scraping by rogue bots and site scrapers, who could possibly find the URL through some unknown or forgotten link someplace - or else, it'll no longer be a valid dup if there's any unauthorized copies floating around someplace.

Done the right way very easily, with all bases covered.

Googlebot doesn't need to see a 404 or a 410 if the url is disallowed in robots.txt.

Scrapers and rogue bots (and hijackers) don't honor robots.txt

[edited by: Marcia at 10:02 pm (utc) on July 24, 2007]

g1smd

10:08 pm on Jul 24, 2007 (gmt 0)

>> Choose this option if you'd like to remove a URL or image. <<

I am fairly sure that what Vanessa Fox was talking about there was simply the criteria for URLs that you submit to their fast-track URL Removal Tool.

For general web spidering, a URL blocked by robots.txt can hang around in the SERPs for years as a URL-only entry - especially if something somewhere still links to it.

Use the HTTP status codes to let everyone see the real status of the page. Don't hide it behind a spidering block.

Marcia

10:19 pm on Jul 24, 2007 (gmt 0)

>>Vanessa Fox

Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them.

I have, and now I'll either have to re-do and rewrite a couple of sites entirely, or ditch them altogether and let them sit someplace parked. Or maybe sell them, domains a few years old should be able to get me ten bucks or so, which is worth more than those sites are now.

Halfdeck

11:28 pm on Jul 24, 2007 (gmt 0)

"I am fairly sure that what Vanessa Fox was talking about there was simply the criteria for URLs that you submit to their fast-track URL Removal Tool."

No, she was talking about getting rid of URLs from the SERPs. Look up "vanessa fox 404" for details.

"For general web spidering, a URL blocked by robots.txt can hang around in the SERPs for years as a URL-only entry - especially if something somewhere still links to it."

Obviously.

"Don't hide it behind a spidering block."

No one suggested relying solely on a disallow directive.

Make it easier on Googlebot and install a robots disallow alongside the 404. Checking a URL path against a disallow directive is easier than checking response headers.

"Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them."

How is that related to 404s or robots disallow? Attacking the credibility or authority of the messanger doesn't invalidate the message.

g1smd

12:46 am on Jul 25, 2007 (gmt 0)

>> Make it easier on Googlebot and install a robots disallow alongside the 404. <<

Again. If you have a robots.txt exclusion for a URL then the bot never gets to "see" the 404. The robots.txt exclusion hides the real status of the page from the bot. That is not a good idea.

Let the bot see the 404. That 404 reply to the bot really does say that there is nothing there.

The robots.txt exclusion is especially weak. It merely says "do not access this URL". It says nothing about whether a page exisits at that URL or not, nor should it.

The robots.txt exclusion still allows that "excluded" URL to show in the SERPs as a URL-only entry. If it continues to show, then it hasn't been removed.

A "404" HTTP status code returned for a URL categorically states that there is nothing there, and since there is nothing there, the URL gets completely removed far more quickly.

Don't hide the 404 status by using robots.txt as well. Use only the 404 page.

Marcia

3:19 am on Jul 25, 2007 (gmt 0)

"Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them."
How is that related to 404s or robots disallow?

Halfdeck, it's already been explained in this thread how it's related. And no, the messenger like have to deal certain other issues and has no cause or obligation to address them. The messenger was addressing an issue from a certain perspective, but there are other considerations related to the issue that was being addressed that really concern no one but the webmaster who's being affected.

If a webmaster wants a page of theirs OUT of the index and wants it totally GONE and unavailable in any way, then 410 is the answer. g1 is 100% correct that robots.txt exclusion will hinder that - for well behaved bots. But if the page hasn't been physically removed, it'll still be accessible to those who don't honor robots.txt - or don't honor anything else, for that matter.

robots.txt exclusion is for excluding pages that are there from being spidered.

[edited by: Marcia at 3:37 am (utc) on July 25, 2007]

Halfdeck

1:05 pm on Jul 25, 2007 (gmt 0)

Halfdeck, it's already been explained in this thread how it's related.

My bad, I missed your previous message. It wasn't me; it was the beer.

A little over a week ago, I used both methods for two different sections of a site:

1) Physically remove a set of URLs and issue 404s. These are top level pages, frequently crawled. No internal links point to these pages now.

2) robots.txt disallow another set of URLs. These are also top level pages, about equal in PageRank to 404ed URLs, probably with similar crawl frequency. Some internal links still point to this set of URLs.

Now the 404 pages, all 16 of them, are gone from the index. The robots.txt disallowed pages are mostly URL-only.

hulahoop

6:04 pm on Jul 25, 2007 (gmt 0)

I was just on my way here to ask about this topic and I saw this thread. Appreciate if anyone here can help me out.

I have a bunch of pages to remove (some still ranked others not) as I want to reorganize my files. After 5 years, things get really messy.

I understand that I can just submit to google pages that are no longer valid. Don't know the link though but have seen it before. Honestly I don't know much about 410 or 404s. Can I just submit the pages that I want remove and be done with it. Will I be penalize in anyway or be at risk in my overall rankings (rankings that matter - not from the pages I want to remove). Really need help to do it right. As I am a SEO challenged, I'd appreciate a step by step guide. Please help.

g1smd

7:08 pm on Jul 25, 2007 (gmt 0)

If the content is no longer on your site at all, remove the page from your site and let your server serve a 404 or 410 response.

If the content has moved to a new URL then set up a 301 redirect from the old URL to the new URL so that wherever the old URL is still publicised the visitor to your site will still be able to get to that content.

If the pages still exist, but you do not want them to be indexed, then add the <meta name="robots" content="noindex"> tag to each of the pages that you do not want to show up in the SERPs.

Marcia

8:10 pm on Jul 25, 2007 (gmt 0)

These return the 410

Redirect gone page.html
or
Redirect gone /foo/
or
Redirect gone /foo/page.html

For returning an error document:

ErrorDocument 410 wentbyebye.html
ErrorDocument 404 notfound.html

hulahoop

7:46 am on Jul 26, 2007 (gmt 0)

Based on what g1smd said I just need to clarify a few things. Basically I have 3 types of pages:

1) Pages to be removed completely. The pages are currently still on my site.
...so I should have the serve up a 404 or 410 response right? Which is better?

2) Pages that I want to rename, edit a little and yet have the PR and traffic from the old page transfered to the new page name.
...use a 301 redirect right? Should I remove the old file from the server immediately or wait a little while? Also will the page rank and traffic picked up from the Search Engines be transferred eventually to the new page or will it have to work its way up again?

3) Pages that I want retire adn removed yet want all traffic transfered to some remaining old pages
...what should I do for these ones?

Thanks a lot.

Marcia

8:59 am on Jul 26, 2007 (gmt 0)

1) 410 if they're being permanently removed
2) It's best to leave URLs unchanged if at all possible, but if not then 301 is what's appropriate. Benefit will transfer (to a degree) but it'll take a little time, and whatever ends up will be based on the edited content, for on-page factors. I'm inclined to remove pages that have been 301'd to other URLs, mainly because the old pages hang around in the Yahoo index longer than I'd like.

3) Isn't 100% clear. Are there inbound links to those pages?

hulahoop

1:58 pm on Jul 26, 2007 (gmt 0)

1) Ok. thanks got it. 410 it is. Also in removing the pages, ( I have about 100 pages), should I do it slowly or is it fine to remove them all immediately?

2) I figure these pages that I want to reorganize are not very heavily visited and so I guess a 301 will be best as suggested. Ok so I should remove them so they don't hang around.

3) I don't think there are inbound link but i'd like to still transfer the traffic and PR these pages have to some existing pages.

jdMorgan

2:57 pm on Jul 26, 2007 (gmt 0)

Once you configure your server to return a 301 or a 410 for a specific URL, then it does not matter whether the file corresponding to that URL continues to exist on the server or not -- The file is unreachable by virtue of the fact that the server will immediately return the 301 or 410 response if that URL is requested, without even looking to see if it resolves to a still-existing file.

A 404 response is just the default server behaviour if a file cannot be found which corresponds to the requested URL.

The problem with a 404 is that it could be your fault (Webmaster linking or scripting error), the server's fault (hardware/software error), or the client's fault (user/spider requesting invalid/corrupt URLs). That's why Google et al repeatedly request URLs that 404: They cannot tell if the 404 error condition is temporary or permanent because they do not know the cause. They can choose to re-request the 404'ed URL repeatedly forever, immediately dump it from their index, or choose some arbitrary time period or retry count after which they dump the URL.

But in two out of those three cases, you can suffer if they dump the URL before you can fix the problem; That page won't rank again for quite some time. Or if they won't dump it when you want them to, then you lose the bandwidth they consume re-requesting that obsolete URL again and again (I used to call this the "Inktomi syndrome").

Using a 410 to specifically say "This page has been intentionally removed" may be advantageous in the future, because SEs that support it properly can eliminate the unnecessary attempts to re-fetch URLs that don't exist any more.

Jim

g1smd

7:19 pm on Jul 26, 2007 (gmt 0)

>> 1) Pages to be removed completely. The pages are currently still on my site. <<

What do you mean by "completely removed"?

* Do you want to remove the pages from your site AND from the SERPs?

If yes, then delete the file from the server and let the URL serve a 404 response - or delete the files and configure the server to return a 410 response.

* OR, do you just want to remove the pages from the SERPs but allow them to still exist on your site?

I get the feeling you want this latter option. If that is the case, then you need a <meta name="robots" content="noindex"> tag on each of those pages.