Does a 404 page need to return a 404 status?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does a 404 page need to return a 404 status?

Tonearm

5:58 pm on Jul 14, 2007 (gmt 0)

Does Google consider it very important that a "page not found" page return a 404? Mine returns a 200 because I'm slowly adding redirects as I notice the "page not found" page is accessed. I'd like to wait a while longer to turn on the 404s, but if it's really important for Google I could do it right away.

What do you guys think?

tedster

6:15 pm on Jul 14, 2007 (gmt 0)

Here's one risk with a 200 status on a custom error page. That page is commonly served as the result of a 302 redirect returned by the originally requested url. Every "bad" url then triggers the same error page, and all those bad urls get indexed as duplicate content -- whatever content is on that error message page.

After a period of time, a large number of urls build up in Google's index that point to the exact same content. I've seen an entire domain slapped with a penalty or removed from the index completely.

g1smd

6:19 pm on Jul 14, 2007 (gmt 0)

What the bot works to, is the page status code:

- 200 - page is here, index this URL and index its content (unless robots blocked).
- 301 - page is over there, go fetch that URL and look at the content over there; forget this URL.
- 302 - page is over there, go index the content over there, but index it under this URL here.
- 404 - there is no page here; nothing to see; move along.

What you show your visitor for a 404 page is up to you, but it can never be a true 404 page unless it returns the HTTP status code of 404 in the HTTP header.

BillyS

9:01 pm on Jul 14, 2007 (gmt 0)

It's worth the time to do this now, especially if you're thinking about doing it down the road. It's just one of the fundamental things that all good webmasters do.

Marcia

9:21 pm on Jul 14, 2007 (gmt 0)

Mine returns a 200 because I'm slowly adding redirects as I notice the "page not found" page is accessed. I'd like to wait a while longer to turn on the 404s, but if it's really important for Google I could do it right away.

There are all kinds of errors and typos that can occur, it shouldn't be necessary to do redirects for each and every one.

I just typed in

www.example.com/whatever.html

for a site of mine, which doesn't exist and never has. Checking the server header, that URL will come up in the browser address bar, but it returns a 404 and delivers the content of my custom error page, which is located at

www.example.com/errorpage.html

Running errorpage.html through a header check, it also returns a 404. Nothing was ever done, except putting a line in .htaccess

ErrorDocument 404 /errorpage.htm

If a page isn't there, it should automatically return a 404, unless it's been designated (in .htaccess) to be 410 - permanantly gone - removed forever - it's history.

This is what that site's .htaccess looks like for missing pages, including a subdirectory that I removed permanently:

Redirect gone /removedfolder/

ErrorDocument 410 /errorpage.html
ErrorDocument 404 /errorpage.html

What's returned is whatever missing pages are supposed to be, either 404 or 410, with nothing more done or needed except for the 410 which was deliberate and needed an entry saying so, so it wouldn't return a 404. Others don't need anything special.

[edited by: Marcia at 9:34 pm (utc) on July 14, 2007]

icedowl

1:08 am on Jul 15, 2007 (gmt 0)

Here's one risk with a 200 status on a custom error page. That page is commonly served as the result of a 302 redirect returned by the originally requested url. Every "bad" url then triggers the same error page, and all those bad urls get indexed as duplicate content -- whatever content is on that error message page.
After a period of time, a large number of urls build up in Google's index that point to the exact same content. I've seen an entire domain slapped with a penalty or removed from the index completely.

Ted, that sounds like the best description I've heard to date of what may have happened to one of my sites. Due to an error on my part, there was a period of time that a 200 status was returned instead of a 404 for all pages of the site when I took it out of a CMS to static pages which required complete URL changes. It was a risk I had to take but I didn't think it would suffer for long. It's been effectively slapped since February (or earlier) although the proper status is being returned now. Any idea of how long it takes to recover?

g1smd

2:13 am on Jul 15, 2007 (gmt 0)

I would guess at least several months, and maybe anything up to a year; based on other related Duplicate Content issues that have cropped up on other sites.

tedster

2:46 am on Jul 15, 2007 (gmt 0)

It's been effectively slapped since February (or earlier) although the proper status is being returned now. Any idea of how long it takes to recover?

I haven't recently been close to any site that recovered from this kind of error page trouble. One I advised last year took three months to show improvement, and that was just the beginning of a gradual recovery cycle.

If bogus urls have supplemental status (pretty likely), then recovery depends on those urls being recrawled -- at the pace of the supplemental index crawl, not the regular index crawl. There's good news here: supplemental urls are now being crawled more frequently than they were last year.

icedowl

4:48 am on Jul 15, 2007 (gmt 0)

Thanks to both of you. It sounds like I'll just need to ignore the situation for a while, keep adding content and making small improvements as is normal. Patience is indeed a virtue but it's frustrating too. It feels like I'm just spinning my wheels when I see Yahoo sending up to 20 times the traffic that Google sends. I have noticed a few more pages coming out of the supplementals recently but even that could be a fluke. Time will tell.

tedster

5:20 am on Jul 15, 2007 (gmt 0)

I'd suggest you yive your site a good technical checkup, too -- really kick the tires, you know? Any other duplicate url issues could delay or even sabotage your recovery. So make sure you have no other multiple urls issues, such as the no-www and with-www varieties, the index.html and directory root type, other unwise use of 302 redirects, and so on.

There are some very good threads about this in our Hot Topics [webmasterworld.com] section, which is always pinned to the top of this forum's indexed page.

For example:
Duplicate Content [webmasterworld.com] - get it right or perish
Duplicate Content [webmasterworld.com] - comments from Google's Adam Lasnik
Thin Affiliate Pages [webmasterworld.com] - with comments from Google's Adam Lasnik
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - avoiding duplicate content

Marcia

5:52 am on Jul 15, 2007 (gmt 0)

So make sure you have no other multiple urls issues, such as the no-www and with-www varieties, the index.html and directory root type

If it's any consolation, I've just seen a site that had rankings fall through the floor several months ago come back after duplicate title & meta description issues were remedied by making the "original" page different from the hundreds of dups that were put across other pages (and are still out there 'til they're changed), and canonical issues between www and non-www were fixed - and the right canonical "version" indicated in Webmaster Central.

It didn't involve 404's and 302's but I've been amazed at how fast the crawlers picked up the right URLs and indexed them all correctly. Corrected URLs started to appear in the index within hours of implementing an .htaccess fix. And to our amazement all is now right, and the rankings for the homepage were back from "nowhere to be found" in less than a week's time.

It may not happen with all cases, but they really are getting better and better and the crawl folks deserve a lot of credit for a job well done.

icedowl

6:06 am on Jul 15, 2007 (gmt 0)

So make sure you have no other multiple urls issues, such as the no-www and with-www varieties, the index.html and directory root type, other unwise use of 302 redirects, and so on.

I just fixed the "index.html and directory root type" issue thanks to one of the hot topics threads. The no-www vs. with-www variety had been fixed months ago and to the best of my knowledge I've got no 302 redirects to deal with.

Anything else I can kick?

tedster

6:31 am on Jul 15, 2007 (gmt 0)

Are you using any url rewriting? That can sometimes introduce trouble, when the rewriting just keys off of a number in the url, but the keyword in the url can still be a typo and get a 200. Also, I'd validate your robots.txt and make sure it's doing exactly what you intend. If you don't already have a Webmaster Tools account, I'd suggest one just for the extra reporting you get back from Google inside.

icedowl

7:12 am on Jul 15, 2007 (gmt 0)

Are you using any url rewriting? That can sometimes introduce trouble, when the rewriting just keys off of a number in the url, but the keyword in the url can still be a typo and get a 200.

Only for some very old urls that still appear in the list of urls getting 404s. There are no numbers in them and they date back before the site was put into a CMS. Essentially I've taken the site and its structure back a few years when I took it out of the CMS. I didn't manage to get the wording of all of the urls exactly as they had been and those are the few that have now been 301'd to their new url.

Also, I'd validate your robots.txt and make sure it's doing exactly what you intend. If you don't already have a Webmaster Tools account, I'd suggest one just for the extra reporting you get back from Google inside.

Robots.txt is as it should be. I have a Webmaster Tools account and have caught quite a few things already from their reporting. However, I am disappointed that much of the information in there is rarely brought up to date. Other than the crawl information it has been weeks since anything changed. The "Link" section is DOA and hasn't changed significantly since they added it. It is completely out of whack. The removal tool did work wonders for the 404 urls that belonged to the CMS version of the site. They're gone.

tedster

7:18 am on Jul 15, 2007 (gmt 0)

I appreciate that the information inside GWT is not always what we would hope for. I'd never suggest anyone rely on this (or any of Google's general reporting operators) for their exclusive source of information about their site, at least not if they are serious about doing business on the web. But GWT can key you into issues and potentials you might not catch any other way -- just keep that proverbial grain of salt close at hand.

Tonearm

2:32 pm on Jul 15, 2007 (gmt 0)

Thanks for all the info. I'm returning real 404s now.

g1smd

4:27 pm on Jul 15, 2007 (gmt 0)

I would suggest scanning the site with Xenu Linksleuth at this point, and carefully studying the results page as well as the final generated report. It can give many insights into things that are not working as expected, and expose "patterns" of problems that you might not be aware of.