|WMT Warning about soft 404 errors|
| 9:58 pm on Aug 21, 2012 (gmt 0)|
Just had an email from webmaster tools warning me about "transient soft 404 errors" on my site. The email says that the errors may indicate an "outage", but I don't think that's the problem.
The email gave some examples of the pages that are causing the trouble. The pages are all on the calendar part of my new forum. Originally I had the calendar open to guests as well as registered users, but then I shut it off to guests. And, of course, that means I shut it off to googlebot.
I'm using a very well-known forum software. The problem pages all return 200 codes (I've checked) and they all now have the same content on them - a message saying "You're not logged in, please log in to see this page" etc etc. The pages don't redirect anywhere, so they all have urls like mysite.com/forum/calender.php?787878 with differing strings of numbers.
So...I'm confused as to what is happening here and what google would like me to do. Why are these pages giving a soft 404 if they are returning a 200 response and have text on them? Should they be giving a 401 instead?
| 11:28 pm on Aug 21, 2012 (gmt 0)|
Yes. Those are classic "soft 404" pages.
There's no content that is of any use, but the page returns 200 OK.
Program the script to return some other status code, perhaps 401 if you need to log in to see the content.
[edited by: g1smd at 11:30 pm (utc) on Aug 21, 2012]
| 11:29 pm on Aug 21, 2012 (gmt 0)|
Sounds like you are actually describing a "soft 404" here - that is, a variety of URLs that all return a 200 OK http status but all show the same content even when the URLs are different.
I think a 403 Forbidden would be more appropriate. A 200 OK indicates that the actual content was returned, and in this case it isn't - the content you're sending is a kind of error message.
| 11:40 pm on Aug 21, 2012 (gmt 0)|
|perhaps 401 if you need to log in to see the content |
Right - that would be better than a 403.
| 1:09 am on Aug 22, 2012 (gmt 0)|
I tried something a little different with my well-known forum software. I noticed that there were a lot of duplicate pages being returned when I searched for exact phrases - between the print version, archive version, SEO-friendly URL version, etc. - so I put an apache directive in the forum directory that basically returns X-Robots of "nofollow, noindex" on every forum page EXCEPT the forumlist, threadlist, and thread pages.
It takes Google forever and a day to actually remove pages from their index, so only time will tell if this was a good strategy, but I counted something like 100,000 pages returned from my site: search that included the forum subdirectory, but there are just 15,000 forum threads, meaning that I had somewhere near 85,000 pages of useless crap indexed.