Forum Moderators: Robert Charlton & goodroi
What do you guys think?
After a period of time, a large number of urls build up in Google's index that point to the exact same content. I've seen an entire domain slapped with a penalty or removed from the index completely.
- 200 - page is here, index this URL and index its content (unless robots blocked).
- 301 - page is over there, go fetch that URL and look at the content over there; forget this URL.
- 302 - page is over there, go index the content over there, but index it under this URL here.
- 404 - there is no page here; nothing to see; move along.
What you show your visitor for a 404 page is up to you, but it can never be a true 404 page unless it returns the HTTP status code of 404 in the HTTP header.
Mine returns a 200 because I'm slowly adding redirects as I notice the "page not found" page is accessed. I'd like to wait a while longer to turn on the 404s, but if it's really important for Google I could do it right away.
I just typed in
www.example.com/whatever.html
for a site of mine, which doesn't exist and never has. Checking the server header, that URL will come up in the browser address bar, but it returns a 404 and delivers the content of my custom error page, which is located at
www.example.com/errorpage.html
Running errorpage.html through a header check, it also returns a 404. Nothing was ever done, except putting a line in .htaccess
ErrorDocument 404 /errorpage.htm
If a page isn't there, it should automatically return a 404, unless it's been designated (in .htaccess) to be 410 - permanantly gone - removed forever - it's history.
This is what that site's .htaccess looks like for missing pages, including a subdirectory that I removed permanently:
Redirect gone /removedfolder/
ErrorDocument 410 /errorpage.html
ErrorDocument 404 /errorpage.html
What's returned is whatever missing pages are supposed to be, either 404 or 410, with nothing more done or needed except for the 410 which was deliberate and needed an entry saying so, so it wouldn't return a 404. Others don't need anything special.
[edited by: Marcia at 9:34 pm (utc) on July 14, 2007]
Here's one risk with a 200 status on a custom error page. That page is commonly served as the result of a 302 redirect returned by the originally requested url. Every "bad" url then triggers the same error page, and all those bad urls get indexed as duplicate content -- whatever content is on that error message page.
After a period of time, a large number of urls build up in Google's index that point to the exact same content. I've seen an entire domain slapped with a penalty or removed from the index completely.
Ted, that sounds like the best description I've heard to date of what may have happened to one of my sites. Due to an error on my part, there was a period of time that a 200 status was returned instead of a 404 for all pages of the site when I took it out of a CMS to static pages which required complete URL changes. It was a risk I had to take but I didn't think it would suffer for long. It's been effectively slapped since February (or earlier) although the proper status is being returned now. Any idea of how long it takes to recover?
It's been effectively slapped since February (or earlier) although the proper status is being returned now. Any idea of how long it takes to recover?
I haven't recently been close to any site that recovered from this kind of error page trouble. One I advised last year took three months to show improvement, and that was just the beginning of a gradual recovery cycle.
If bogus urls have supplemental status (pretty likely), then recovery depends on those urls being recrawled -- at the pace of the supplemental index crawl, not the regular index crawl. There's good news here: supplemental urls are now being crawled more frequently than they were last year.
There are some very good threads about this in our Hot Topics [webmasterworld.com] section, which is always pinned to the top of this forum's indexed page.
For example:
Duplicate Content [webmasterworld.com] - get it right or perish
Duplicate Content [webmasterworld.com] - comments from Google's Adam Lasnik
Thin Affiliate Pages [webmasterworld.com] - with comments from Google's Adam Lasnik
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - avoiding duplicate content
So make sure you have no other multiple urls issues, such as the no-www and with-www varieties, the index.html and directory root type
It didn't involve 404's and 302's but I've been amazed at how fast the crawlers picked up the right URLs and indexed them all correctly. Corrected URLs started to appear in the index within hours of implementing an .htaccess fix. And to our amazement all is now right, and the rankings for the homepage were back from "nowhere to be found" in less than a week's time.
It may not happen with all cases, but they really are getting better and better and the crawl folks deserve a lot of credit for a job well done.
So make sure you have no other multiple urls issues, such as the no-www and with-www varieties, the index.html and directory root type, other unwise use of 302 redirects, and so on.
I just fixed the "index.html and directory root type" issue thanks to one of the hot topics threads. The no-www vs. with-www variety had been fixed months ago and to the best of my knowledge I've got no 302 redirects to deal with.
Anything else I can kick?
Are you using any url rewriting? That can sometimes introduce trouble, when the rewriting just keys off of a number in the url, but the keyword in the url can still be a typo and get a 200.
Only for some very old urls that still appear in the list of urls getting 404s. There are no numbers in them and they date back before the site was put into a CMS. Essentially I've taken the site and its structure back a few years when I took it out of the CMS. I didn't manage to get the wording of all of the urls exactly as they had been and those are the few that have now been 301'd to their new url.
Also, I'd validate your robots.txt and make sure it's doing exactly what you intend. If you don't already have a Webmaster Tools account, I'd suggest one just for the extra reporting you get back from Google inside.
Robots.txt is as it should be. I have a Webmaster Tools account and have caught quite a few things already from their reporting. However, I am disappointed that much of the information in there is rarely brought up to date. Other than the crawl information it has been weeks since anything changed. The "Link" section is DOA and hasn't changed significantly since they added it. It is completely out of whack. The removal tool did work wonders for the 404 urls that belonged to the CMS version of the site. They're gone.