|SEO help with missing 404 server responses|
| 10:21 am on May 2, 2013 (gmt 0)|
I know this is a deep topic, but I would much appreciate your brief help about this.
What will happen if a CMS and the server is not customized and set up in a way to return 404 server header response when a non-existent page is requested? I know this is easy to set in the sever level, but what will happen if someone forgot to do so?
I see that almost all the non-existent pages in my website returns 200 response when they are supposed to return 404.
Thanks for all your help.
| 12:27 pm on May 2, 2013 (gmt 0)|
You'll get a message in your GWT about "soft" 404s, for one thing.
| 12:43 pm on May 2, 2013 (gmt 0)|
Thanks Netmeg. So are you sure that GWT soft 404 errors are based on such things?
Will it have a negative impact on the site? I mean when a page is deleted, it should still return a 200 and Google will continue to crawl that empty error page and index it right?
| 4:39 pm on May 2, 2013 (gmt 0)|
Eventually if you get enough of them, it probably will have some negative effect. For one thing, it may well goof up your crawl budget, and you'll find Google coming back to pick up your real pages less often.
If you remove a page, you should serve a real 404 or a 410. If your CMS can't handle it, then either fix it or change to another CMS. Which of course is a PITA but them's the choices if you care at all about search traffic.
| 8:31 pm on May 2, 2013 (gmt 0)|
|I see that almost all the non-existent pages in my website returns 200 response when they are supposed to return 404 |
shaunm - As I remember, on one question you posted a while back, it turned out that you were running a CMS on IIS. It's unfortunately fairly standard procedure in IIS to set up custom error pages improperly, where the error is 302 redirected to a custom error page that returns a 200 OK.
With this kind of arrangement, certain types of errors can theoretically return an infinite number of 200 responses for non-existent urls. Google never sees the proper 404 response. The problem is so common that Google has figured out a way to detect it, at least to the degree of returning "soft 404s" . IMO, it's a problem that should be fixed in your server setup.
On some IIS/.NET CMS systems I've seen (shudder), the documentation and interface check boxes are so confusing, though, that it's truly not clearly what's going on, or what header response to expect. (And a further confusion... some of these CMS systems also use dynamically generated canonical tags, thinking that this will address all problems.)
For more detail on at least the tip of the iceberg, take a look at this thread, which might get you started on fixing it...
Custom Error Pages - Beware the Server Header Status Code
| 10:29 am on May 3, 2013 (gmt 0)|
Thank for answering on my post and that useful link :-)
I know it's pretty easy to set up in the sever level, but until then I just wanted to know what negative impact it will have on the site.
Do you think a site with below 60 pages should worry about this kind of technical faults? And I like the way you remembered my previous post that the site is on IIS server :-)
| 11:17 am on May 3, 2013 (gmt 0)|
I've seen smaller sites get into trouble this way in the past - from two scenarios, one malicious and one innocent. I your site begins to attract attention (i.e. rank well competitively) and your competition notices the vulnerability, then an intentional, negative SEO attack can spring up. And otherwise, you can attract automatic linking from various widespread spammer scripts that grab outbound backlinks to try to improves the trust of the spam site.
In either case, I don't consider it worth the risk, given the simplicity of the fix. If there is some obstacle, at the very least use a canonical URL link on every page.
| 11:29 am on May 3, 2013 (gmt 0)|
Thank you Ted :-)
I am really confused. In what way such an issue can attract spammers? I mean how can they actually get benefit from it?
Say my site is abc.com and it has to return a 404 response when a page like abc.com/zzzz is requested by a brwoser, but in reality it returns a blank page with a 200 server response. How can my competition or spammers get use of it?
| 7:28 pm on May 3, 2013 (gmt 0)|
|In what way such an issue can attract spammers? I mean how can they actually get benefit from it? |
It's not clear that they can, for real. However, with the release of the Google Webmaster video a few years ago - focusing on the value of credible outbound links - it became "a thing" for spammers to experiment with. Once it's in their formula, it's slow to leave because spamming is often a mass-volume, churn and burn business.
| 9:17 pm on May 3, 2013 (gmt 0)|
|in reality it returns a blank page with a 200 server response |
That's potentially several different things.
First: If you're using a CMS, your own server will always record a 200: the request has been successfully handed-off to the php script or similar. What matters is the response the user receives.
Second: The blank page may mean that the CMS isn't doing its thing, so bad parameters lead to an empty page accompanied by a 200 header.
Third: In the slightly better alternative, the CMS meant to return a 404 response, but by the time it figured out that the parameters were bad, it had already output some content-- maybe a <head> section, maybe just a humble line break-- so it was no longer able to send the 404.
Fourth: The recipient is receiving a 404. But the CMS isn't supplementing this with the physically visible custom 404 page, so humans see an empty window.
Fifth: ... et cetera.
Your task is to
#1 make sure the CMS doesn't generate any html before it is certain that it's got valid parameters
#2 if parameters are bad, return a 404 response
#3 accompany the 404 response with display of your custom 404 page. The CMS has to do it; the server can't.
Once google discovers that requesting absolutely anything results in a 200, it will go berserk and ask for nonstop garbage. I can always tell when the googlebot is getting suspicious of some directory, because logs show a flurry of requests for /directory/some-random-garbage.html (Oddly, it has never yet become suspicious of directories where I really did forget to code for 404s. But luck does not last forever.)
| 7:52 am on May 6, 2013 (gmt 0)|
Thank you, again :-)
Thank you! I must admit it was informative for me. And I hope you'll answer my following questions as well :-)
|First: If you're using a CMS, your own server will always record a 200: |
So, this cannot happen with hosting the website in some other server right?
|Second: The blank page may mean that the CMS isn't doing its thing, so bad parameters lead to an empty page accompanied by a 200 header. |
I am sorry that I was wrong. It's neither a blank page nor a custom 404 error page. But a page with with all the top menus, navigation which says 'Error - Page Not Found
You are seeing this error because you tried to access a web page that does not exist' but still the server response is 200 when checked with seobooks server response tool or any other server response checker tools.
|Third: In the slightly better alternative, the CMS meant to return a 404 response, but by the time it figured out that the parameters were bad, it had already output some content-- maybe a <head> section |
Can this actually happen? I mean it's not going to take a minute or an hour, everything performed within 3 to 5 secs right? Can this occur within that time frame?
|maybe just a humble line break-- so it was no longer able to send the 404. |
What is a link break?
Final and very eagerly question
|Once google discovers that requesting absolutely anything results in a 200 |
How does Google actually discovers that requesting anything returns in a 200 for a website? I mean it only follows the internal links through its crawling process right? And those any pages that result in a 200 are not even getting crawled or indexed for that matter. Then how can Google do it?
Thanks for all your help :-)
| 8:34 am on May 6, 2013 (gmt 0)|
|So, this cannot happen with hosting the website in some other server right? |
It doesn't matter whose server it is. I meant: the server that the site lives on. The key part is: It doesn't matter what response your server sends out. What matters is what response the user receives. If you've got pure static html, they will be the same. But they don't have to be. I have only recently wrapped my mind around this myself.
|Can this actually happen? I mean it's not going to take a minute or an hour, everything performed within 3 to 5 secs right? Can this occur within that time frame? |
It only takes a microsecond for the server to output a line break. That blank space on your page before the leading <?php can be lethal. In fact what you describe sounds like that is exactly what's happening: You get the beginnings of a page, but then the server realizes that it has been handed a set of bum parameters, and shoves in the content of your 404 page. You need to tweak the code to make sure no html is generated before the server is sure there will be a valid page at the end of the process. One way is output buffering. ("Prepare this stuff, but don't release it until I say so.") There are probably at least six other ways. But that's a php question, which means it will be better answered by almost anyone in the world other than me ;)
|How does Google actually discovers that requesting anything returns in a 200 for a website? I mean it only follows the internal links through its crawling process right? |
I don't know what triggers a Garbage Request. But now and then in logs I'll find something like "/paintings/rats/qrkltejtl.html" which is obviously not a typo or garbled link. The search engine is testing whether your site returns a 404 when there can't possibly be a page matching the request. There are two kinds of bad response. One is the global redirect to the front page ("soft 404"); the other is the one you're getting.
| 9:19 am on May 6, 2013 (gmt 0)|
|It doesn't matter what response your server sends out. What matters is what response the user receives |
Don't the users always receive what the server sends out? I am bit confused. And by saying user, do you mean the browsers?
One more newbie question. How can I see those garbage requests? I mean how do I access the log files?
Again, thanks for all the ideas.
| 4:30 pm on May 6, 2013 (gmt 0)|
|Don't the users always receive what the server sends out? I am bit confused. And by saying user, do you mean the browsers? |
Well, "user-agent" if you want to be formal about it. So either the human user, via their browser, or the robot.
If you have static html pages, then what the server says is what the human sees:
user requests page
server finds page
server sends out 200 response plus page
site logs record 200
user-agent receives page and response
human sees page
But if pages are constructed on the fly via php or similar, there's an intermediate step:
user requests page
server finds code (probably in htaccess) rewriting request to php page
server finds php page
server sends out 200 response and gets to work on page
site logs record 200
server EITHER successfully prepares page and sends it out,
OR fails to prepare page
In the second case, your php page has the further option of replacing the original 200 header with something else, such as 404 or 301. But it can only do this if it hasn't already started constructing the page.
|One more newbie question. How can I see those garbage requests? I mean how do I access the log files? |
That depends on your host. If you're not allowed to see raw logs, change hosts. Logs aren't stored in the same physical directory as your site files, so the server administrator-- your host-- has to do a tiny bit of extra work to break up logs by domain and let users see them.