homepage Welcome to WebmasterWorld Guest from 54.167.138.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
SEO help with missing 404 server responses
shaunm




msg:4569977
 10:21 am on May 2, 2013 (gmt 0)

Hi,

I know this is a deep topic, but I would much appreciate your brief help about this.

What will happen if a CMS and the server is not customized and set up in a way to return 404 server header response when a non-existent page is requested? I know this is easy to set in the sever level, but what will happen if someone forgot to do so?

I see that almost all the non-existent pages in my website returns 200 response when they are supposed to return 404.

Thanks for all your help.

 

netmeg




msg:4570016
 12:27 pm on May 2, 2013 (gmt 0)

You'll get a message in your GWT about "soft" 404s, for one thing.

shaunm




msg:4570020
 12:43 pm on May 2, 2013 (gmt 0)

Thanks Netmeg. So are you sure that GWT soft 404 errors are based on such things?

Will it have a negative impact on the site? I mean when a page is deleted, it should still return a 200 and Google will continue to crawl that empty error page and index it right?

netmeg




msg:4570081
 4:39 pm on May 2, 2013 (gmt 0)

Eventually if you get enough of them, it probably will have some negative effect. For one thing, it may well goof up your crawl budget, and you'll find Google coming back to pick up your real pages less often.

If you remove a page, you should serve a real 404 or a 410. If your CMS can't handle it, then either fix it or change to another CMS. Which of course is a PITA but them's the choices if you care at all about search traffic.

[support.google.com...]

Robert Charlton




msg:4570136
 8:31 pm on May 2, 2013 (gmt 0)

I see that almost all the non-existent pages in my website returns 200 response when they are supposed to return 404

shaunm - As I remember, on one question you posted a while back, it turned out that you were running a CMS on IIS. It's unfortunately fairly standard procedure in IIS to set up custom error pages improperly, where the error is 302 redirected to a custom error page that returns a 200 OK.

With this kind of arrangement, certain types of errors can theoretically return an infinite number of 200 responses for non-existent urls. Google never sees the proper 404 response. The problem is so common that Google has figured out a way to detect it, at least to the degree of returning "soft 404s" . IMO, it's a problem that should be fixed in your server setup.

On some IIS/.NET CMS systems I've seen (shudder), the documentation and interface check boxes are so confusing, though, that it's truly not clearly what's going on, or what header response to expect. (And a further confusion... some of these CMS systems also use dynamically generated canonical tags, thinking that this will address all problems.)

For more detail on at least the tip of the iceberg, take a look at this thread, which might get you started on fixing it...

Custom Error Pages - Beware the Server Header Status Code
http://www.webmasterworld.com/google/3626149.htm [webmasterworld.com]

shaunm




msg:4570321
 10:29 am on May 3, 2013 (gmt 0)

@Netmeg
Thanks again.

@Bob
Thank for answering on my post and that useful link :-)

I know it's pretty easy to set up in the sever level, but until then I just wanted to know what negative impact it will have on the site.

Do you think a site with below 60 pages should worry about this kind of technical faults? And I like the way you remembered my previous post that the site is on IIS server :-)

tedster




msg:4570327
 11:17 am on May 3, 2013 (gmt 0)

I've seen smaller sites get into trouble this way in the past - from two scenarios, one malicious and one innocent. I your site begins to attract attention (i.e. rank well competitively) and your competition notices the vulnerability, then an intentional, negative SEO attack can spring up. And otherwise, you can attract automatic linking from various widespread spammer scripts that grab outbound backlinks to try to improves the trust of the spam site.

In either case, I don't consider it worth the risk, given the simplicity of the fix. If there is some obstacle, at the very least use a canonical URL link on every page.

shaunm




msg:4570329
 11:29 am on May 3, 2013 (gmt 0)

Thank you Ted :-)

I am really confused. In what way such an issue can attract spammers? I mean how can they actually get benefit from it?

Say my site is abc.com and it has to return a 404 response when a page like abc.com/zzzz is requested by a brwoser, but in reality it returns a blank page with a 200 server response. How can my competition or spammers get use of it?

tedster




msg:4570442
 7:28 pm on May 3, 2013 (gmt 0)

In what way such an issue can attract spammers? I mean how can they actually get benefit from it?

It's not clear that they can, for real. However, with the release of the Google Webmaster video a few years ago - focusing on the value of credible outbound links - it became "a thing" for spammers to experiment with. Once it's in their formula, it's slow to leave because spamming is often a mass-volume, churn and burn business.

lucy24




msg:4570472
 9:17 pm on May 3, 2013 (gmt 0)

in reality it returns a blank page with a 200 server response

That's potentially several different things.

First: If you're using a CMS, your own server will always record a 200: the request has been successfully handed-off to the php script or similar. What matters is the response the user receives.

Second: The blank page may mean that the CMS isn't doing its thing, so bad parameters lead to an empty page accompanied by a 200 header.

Third: In the slightly better alternative, the CMS meant to return a 404 response, but by the time it figured out that the parameters were bad, it had already output some content-- maybe a <head> section, maybe just a humble line break-- so it was no longer able to send the 404.

Fourth: The recipient is receiving a 404. But the CMS isn't supplementing this with the physically visible custom 404 page, so humans see an empty window.

Fifth: ... et cetera.

Your task is to
#1 make sure the CMS doesn't generate any html before it is certain that it's got valid parameters
#2 if parameters are bad, return a 404 response
#3 accompany the 404 response with display of your custom 404 page. The CMS has to do it; the server can't.

Once google discovers that requesting absolutely anything results in a 200, it will go berserk and ask for nonstop garbage. I can always tell when the googlebot is getting suspicious of some directory, because logs show a flurry of requests for /directory/some-random-garbage.html (Oddly, it has never yet become suspicious of directories where I really did forget to code for 404s. But luck does not last forever.)

shaunm




msg:4571022
 7:52 am on May 6, 2013 (gmt 0)

@Ted
Thank you, again :-)

@Lucy24
Thank you! I must admit it was informative for me. And I hope you'll answer my following questions as well :-)

First: If you're using a CMS, your own server will always record a 200:


So, this cannot happen with hosting the website in some other server right?

Second: The blank page may mean that the CMS isn't doing its thing, so bad parameters lead to an empty page accompanied by a 200 header.


I am sorry that I was wrong. It's neither a blank page nor a custom 404 error page. But a page with with all the top menus, navigation which says 'Error - Page Not Found
You are seeing this error because you tried to access a web page that does not exist' but still the server response is 200 when checked with seobooks server response tool or any other server response checker tools.

Third: In the slightly better alternative, the CMS meant to return a 404 response, but by the time it figured out that the parameters were bad, it had already output some content-- maybe a <head> section


Can this actually happen? I mean it's not going to take a minute or an hour, everything performed within 3 to 5 secs right? Can this occur within that time frame?

maybe just a humble line break-- so it was no longer able to send the 404.


What is a link break?

Final and very eagerly question
Once google discovers that requesting absolutely anything results in a 200


How does Google actually discovers that requesting anything returns in a 200 for a website? I mean it only follows the internal links through its crawling process right? And those any pages that result in a 200 are not even getting crawled or indexed for that matter. Then how can Google do it?

Thanks for all your help :-)

lucy24




msg:4571038
 8:34 am on May 6, 2013 (gmt 0)

So, this cannot happen with hosting the website in some other server right?

It doesn't matter whose server it is. I meant: the server that the site lives on. The key part is: It doesn't matter what response your server sends out. What matters is what response the user receives. If you've got pure static html, they will be the same. But they don't have to be. I have only recently wrapped my mind around this myself.

Can this actually happen? I mean it's not going to take a minute or an hour, everything performed within 3 to 5 secs right? Can this occur within that time frame?

It only takes a microsecond for the server to output a line break. That blank space on your page before the leading <?php can be lethal. In fact what you describe sounds like that is exactly what's happening: You get the beginnings of a page, but then the server realizes that it has been handed a set of bum parameters, and shoves in the content of your 404 page. You need to tweak the code to make sure no html is generated before the server is sure there will be a valid page at the end of the process. One way is output buffering. ("Prepare this stuff, but don't release it until I say so.") There are probably at least six other ways. But that's a php question, which means it will be better answered by almost anyone in the world other than me ;)

How does Google actually discovers that requesting anything returns in a 200 for a website? I mean it only follows the internal links through its crawling process right?

I don't know what triggers a Garbage Request. But now and then in logs I'll find something like "/paintings/rats/qrkltejtl.html" which is obviously not a typo or garbled link. The search engine is testing whether your site returns a 404 when there can't possibly be a page matching the request. There are two kinds of bad response. One is the global redirect to the front page ("soft 404"); the other is the one you're getting.

shaunm




msg:4571041
 9:19 am on May 6, 2013 (gmt 0)

@Lucy24

Thanks again!

It doesn't matter what response your server sends out. What matters is what response the user receives


Don't the users always receive what the server sends out? I am bit confused. And by saying user, do you mean the browsers?

One more newbie question. How can I see those garbage requests? I mean how do I access the log files?

Again, thanks for all the ideas.

lucy24




msg:4571180
 4:30 pm on May 6, 2013 (gmt 0)

Don't the users always receive what the server sends out? I am bit confused. And by saying user, do you mean the browsers?

Well, "user-agent" if you want to be formal about it. So either the human user, via their browser, or the robot.

If you have static html pages, then what the server says is what the human sees:
user requests page
server finds page
server sends out 200 response plus page
site logs record 200
user-agent receives page and response
human sees page

But if pages are constructed on the fly via php or similar, there's an intermediate step:
user requests page
server finds code (probably in htaccess) rewriting request to php page
server finds php page
server sends out 200 response and gets to work on page
site logs record 200
server EITHER successfully prepares page and sends it out,
OR fails to prepare page

In the second case, your php page has the further option of replacing the original 200 header with something else, such as 404 or 301. But it can only do this if it hasn't already started constructing the page.

One more newbie question. How can I see those garbage requests? I mean how do I access the log files?

That depends on your host. If you're not allowed to see raw logs, change hosts. Logs aren't stored in the same physical directory as your site files, so the server administrator-- your host-- has to do a tiny bit of extra work to break up logs by domain and let users see them.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved