homepage Welcome to WebmasterWorld Guest from 54.204.249.184
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
Google WMT reports custom 404 page as a soft-404
Angonasec




msg:4312529
 10:27 pm on May 14, 2011 (gmt 0)

Virtual server site, all flat-file, static html.

For years I've had a simple 662 byte custom 404 page working fine with no problems, no funny business whatsoever. No meta refreshes, just a search box, and a few words.

Type a duff url for our site, and up pops the custom /404.html page every time.

Today, in our Google Webmaster Tools console, I noticed my first ever "Soft 404" (meaning a page that returns a 200 server response, instead of a genuine 404 page not found server response.) Just the one.

The page Google is showing in crawl errors as a "soft 404" is my custom /404.html page, thus;

Crawl errors: Soft-404
www.mysite.tld/404.html 404-like content May 11, 2011

The 404.html page is of course NOT linked-to anywhere on my site, and it has always had a meta name="robots" content="noindex, noarchive, nofollow" tag to prevent spiders including it.

In my root .htaccess file there's always been the directive:

ErrorDocument 404 /404.html

Additionally, I've always disallowed all bots, via robots.txt, from /404.html
User-agent: *
Disallow: /404.html

So Googlebot should never have crawled that page +directly+, but it did, here's the relevant log entry:

66.249.72.74 - - [11/May/2011:23:42:59 -0400] "GET /robots.txt HTTP/1.1" 200 1221 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.74 - - [11/May/2011:23:42:59 -0400] "GET /404.html HTTP/1.1" 200 662 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


It seems Google now expect the true url of a custom 404 page to return a 404 response.

What lunacy is this?

 

aakk9999




msg:4312632
 11:38 am on May 15, 2011 (gmt 0)

Type a duff url for our site, and up pops the custom /404.html page every time.


Have you checked headers to confirm that this returns HTTP response 404 rather than 200?

EG. a request for /invalid-url returns directly 404 with the content of 404.html rather than being redirected to 404.html that then serves 404

Because if you return 404 directly a a result of a request to non-existing URL with the content of 404.html, then I have no idea how G. could even find 404.html exists.

g1smd




msg:4312633
 11:45 am on May 15, 2011 (gmt 0)

The usual problem is sites that use

ErrorDocument 404 www.example.com/404.html

which returns a 302 redirect as mentioned in the Apache documentation.

However, your code example rules that option out.

I don't know the answer (but post the above answer for all those people who don't know they are serving a 302 redirect for pages that are not found).

maximillianos




msg:4312652
 1:32 pm on May 15, 2011 (gmt 0)

My guess is the same as aakk999. Make sure your returning a proper 404 header. You can test this by using any number of online tools. Google "http header checker".

Robert Charlton




msg:4312763
 7:27 pm on May 15, 2011 (gmt 0)

Type a duff url for our site, and up pops the custom /404.html page every time.

Definitely use a header checker.

It helps also to note the expected behavior. If you type in http://www.example.com/duffurl.html, you should not expect to see http://www.example.com/404.html in your browser's address bar.

The address bar should show http://www.example.com/duffurl.html, and the page returned for that url should be a 404 page or custom 404 page, with a 404 header response.

If you're getting http://www.example.com/404.html in the address bar, with a 302 or a 200 response, then things are set up incorrectly.

tedster




msg:4312831
 11:48 pm on May 15, 2011 (gmt 0)

And let me add - that Windows IIS server admins (especially MS trained) have often been the ones who set things up like that. In fact, about five years ago Microsoft-approved manuals actually recommended this incorrect approach for a custom 404 error page.

Angonasec




msg:4312843
 12:48 am on May 16, 2011 (gmt 0)

Thank you for your replies.

I made all the tests you each suggest, finding nothing wrong with our Apache server set up.

Mr. Swain's header-checker shows our server returns a true 404 response for a ourdomian.tld/nosuchpage.html

And of course a correct 200 response for ourdomain.tld/404.html

I also double checked that any duff url trying to bring up a page on our domain shows our custom 404.html page, and the duff url remains in the browser address bar.

All this is perfectly normal and according to Apache defaults, so why is Google showing our custom 404.html page as a Crawl error in WMT?

Sheer lunacy.

aakk9999




msg:4312882
 4:24 am on May 16, 2011 (gmt 0)

Hmm.. maybe the URL was somehow exposed before or Google just made a shot in the dark with requesting 404.html (common name for a such page) and got 200 OK so it keeps requesting it.

You could try to rename your 404.html into (for example) page-request-error-404.html and set this up as ErrorDocument. Then a G. request for 404.html would return 404 (since 404.html would not be found any more) and you would be serving the body of 404 error response from the new filename not known to Google.

Angonasec




msg:4313372
 11:24 pm on May 16, 2011 (gmt 0)

aakk9999
Q/
...then I have no idea how G. could even find 404.html exists.
/Q

Googlebot knew of the existence of /404.html because, as noted in my original post, it is listed in my robots.txt in order to Disallow robots from crawling the page and filling their index with useless pages.

I've always disallowed all bots, via robots.txt, from /404.html
User-agent: *
Disallow: /404.html

Googlebot broke the rule, and listed a spurious "error" in WMT because of it.

(If I made a new file name for our custom 404 page, I'd still list it in robots.txt, for the same reason we list the current one, therefore the new url would also soon be designated as a "soft-404" by Googlebot. Sheer madness.)

aakk9999




msg:4313402
 12:27 am on May 17, 2011 (gmt 0)

Googlebot knew of the existence of /404.html because, as noted in my original post, it is listed in my robots.txt in order to Disallow robots from crawling the page and filling their index with useless pages.

Sorry, I missed that!

If I made a new file name for our custom 404 page, I'd still list it in robots.txt, for the same reason we list the current one

With regards to the above, if you have a file on the server that has obscure name and you only use this file to serve 404, and if you make sure you do not use its URL in address bar (hence removing a chance to expose it via G. toolbar) then how would anyone know the page physically exist?

Therefore there would be no need to list it in the robots.txt as doing so exposes its filename.

In that case meta robots noindex in that file should be enough as a fallback.

As to why Google crawled it - there have been other reports on Google crawling pages disallowed via robots.txt - I have also seen few examples on the sites I am looking after.

One thing to check is that you do not have separate googlebot directive in your robots.txt, and if you do, that you repeat there everything under Useragent * that you do not want Google to crawl - there had been an earlier thread where it was said that if there is a general useragent directive and Google specific directive, Google will only follow google specific directive and ignore all entries listed under * directive.

However, in my case there was only Useragent * section but Google still crawled pages listed there as disallowed.

Angonasec




msg:4313414
 12:52 am on May 17, 2011 (gmt 0)

Thanks for your helpful reply aakk9999.

I'll give some thought to your suggestions.

This may be relevant...

Q/
One thing to check is that you do not have separate googlebot directive in your robots.txt, and if you do, that you repeat there everything under Useragent * that you do not want Google to crawl - there had been an earlier thread where it was said that if there is a general useragent directive and Google specific directive, Google will only follow google specific directive and ignore all entries listed under * directive.
/Q

In our case Googlebot certainly ignored the universal robots.txt directive.

g1smd




msg:4313422
 1:18 am on May 17, 2011 (gmt 0)

Yes, to be absolutely clear, given this
robots.txt file:

User-agent: *
Disallow: /folder1
Disallow: /folder2


User-agent: GoogleBot
Disallow: /folder3


Google will spider
/folder1 and /folder2 because Google reads only the User-agent: GoogleBot directive.

In your case, I would use a more obscure name for the file and add the
meta robots noindex tag to it. The 404 page would also link out to significant sections and pages of the site.
Angonasec




msg:4313818
 8:43 pm on May 17, 2011 (gmt 0)

Thanks g1smd.

I find it astonishing that Googlebot ignores the universal UA Disallow when it finds a subsequent UA rule for itself!

If that is really so, it could have happened in my case.

Rather than re-name my 404.html file, I'll simply repeat the Disallow 404.html in my UA Googlebot robots.txt rule and see if it obeys that and drops the Soft-404 crawl "error" in GWMT.

I tested the new robots.txt format in GWMT using their tool, and the tool gave the desired result, ie. Gbot couldn't crawl 404.html because of the duplicated rule. So we will see what happens with the live robots.txt

I'm loathe to fiddle with the 404.html file name because it may confuse other bots, after all there's nothing wrong with how it's set up on our site. The problem is with Gbot ignoring the universal Disallow rule.

Angonasec




msg:4313820
 8:49 pm on May 17, 2011 (gmt 0)

I wonder if Bingbot and Y! bots +also+ ignore a Universal Disallow when +they+ find a specific UA rule for them in robots.txt?

suggy




msg:4313829
 9:14 pm on May 17, 2011 (gmt 0)

Just to confirm, if googlebot finds a specific rule for itself it then ignores ALL general * rules (dumb, I know).

I know this for a fact, because it's the very reason why google suddenly starting indexing my shopping cart, etc pages this monthp; I made the 'mistake' (yeah, sure it's my error!) of adding a specific rule for googlebot to robots.txt last month, in response to Panda.

What are google like?!

g1smd




msg:4313837
 9:27 pm on May 17, 2011 (gmt 0)

There's a long thread from several years ago where both Matt Cutts and Vanessa Fox confirmed the behaviour I described above.


I'm loathe to fiddle with the 404.html file name

Since the error message should be served at the originally requested URL, the actual internal filename of the error file shouldn't be known to external agents. There will be no issue in changing the filename. Don't mention it in your
robots.txt file, add the meta robots noindex tag to the page itself.
incrediBILL




msg:4313841
 9:37 pm on May 17, 2011 (gmt 0)

It seems Google now expect the true url of a custom 404 page to return a 404 response.

What lunacy is this?


It's actually 100% accurate.

If the requested URL returns a 404 error and 404.html file, it's a hard 404 error.

If the requested URL returns a 200 error with a 404.html file, that is considered a soft 404, and it's likely a combination of the page and the content they evaluate, not just the URL, as other pages are also considered a soft 404.

For instance, if a failed URL always redirects to your home page by default, that is also considered a soft 404. Pages stating they are under construction will show up as 404-like, so on and so forth.

I could go on and on about soft 404s, a specialty of mine, but it's 100% dead on the money properly reporting what it should be doing.

Besides, it's not hurting anything being in the WMT report, ignore it.

pierrefar




msg:4314055
 12:28 pm on May 18, 2011 (gmt 0)

Hi Angonasec,

I just want to confirm two points made in this thread: if Googlebot sees what looks like an error page but has an HTTP status of 200 (i.e. success), it is called a soft 404. As per the HTTP specification, if a page is not found, it should return a 404 header. It is generally OK to have a 302 redirect to the 404 page.

Regardless, it's not advisable to block Googlebot from the 404.html page as otherwise Googlebot won't discover it and that may negatively impact your website's indexing and user experience.

Also as per the robots.txt specification, each bot that obeys the robots exclusion protocol will first look for a specific group of directives that apply only to it. If none are found then the broad-matching group is what gets applied. This allows you to be very flexible in blocking/allowing the different Googlebots (e.g. Googlebot-Mobile vs Googlebot vs bots from other search engines) for different sections of your website. To give one simple example, an AdWords advertiser may prefer that Googlebot not index their ad landing pages and allow only Google AdsBot through. By following any specific directives first, the advertiser can do so.

g1smd




msg:4314065
 12:44 pm on May 18, 2011 (gmt 0)

It is generally OK to have a 302 redirect to the 404 page.

No. The originally requested URL (where that page is not found) should directly return the 404 status. There should be no "redirect".

pierrefar




msg:4314106
 2:11 pm on May 18, 2011 (gmt 0)

Yes a 302 redirect is not ideal and better avoided if possible. However, it is a common-enough situation that Googlebot tries to handle, usually successfully.

tedster




msg:4314112
 2:24 pm on May 18, 2011 (gmt 0)

It is generally OK to have a 302 redirect to the 404 page

The 100% correct approach, technically, is exactly as g1smd says.

However, many sites that are not on Apache servers have a challenge (or a wrongly educated admin.) As long as the 404 page also shows a 404 status in the server header, then Google has been dealing with 302 > 404 situations pretty well.

Without the 404 status, however, the risk of piling up a huge number of "duplicate content" URLs is very real - although Google seems to catch that situation a bit better in recent times. That's one of the main reasons that the soft 404 warning message is now being generated.

It's like this in other areas, too, such as case sensitivity. Google knows that Microsoft servers are like the village idiot, so they do work to accommodate their non-standard behaviors. They have to, or they risk not representing many enterprise websites well in the Google SERPs.

And on the Microsoft side of things, recent IIS server versions are a bit more flexible at allowing the savvy admin to follow web standards.

Samizdata




msg:4314155
 2:58 pm on May 18, 2011 (gmt 0)

662 byte custom 404 page

Some browsers reportedly have problems with custom error pages that are very small, consequently ignoring the custom error and getting the server default instead.

I have seen 512 bytes cited as a minimum, and while you exceed that (and no browser is involved) this may be an area worth investigating - adding some more content to the error page might possibly make a difference.

Google Webmaster Tools reporting is notoriously idiosyncratic.

/ clutching straw

...

g1smd




msg:4314169
 3:12 pm on May 18, 2011 (gmt 0)

Some browsers reportedly have problems with custom error pages that are very small, consequently ignoring the custom error and getting the server default instead.

Since it is the server that sends the error message that it chooses to send, there is no way for the browser to "get" the "server default", whatever that is.

For returned files under 512 bytes, most versions of IE show a page that is served up by the browser with the content being sent by the server completely ignored.

Some ISPs also detect when a user is going to be served a 404 page and show their own page with adverts on.

Samizdata




msg:4314199
 3:48 pm on May 18, 2011 (gmt 0)

the "server default", whatever that is

I happily bow to your greater knowledge, but I have occasionally seen in my shared hosting environment an overall server default error instead of my own custom one.

As indicated, I was clutching at a straw, and I will now drown ignominiously.

...

aakk9999




msg:4314387
 10:44 pm on May 18, 2011 (gmt 0)

I happily bow to your greater knowledge, but I have occasionally seen in my shared hosting environment an overall server default error instead of my own custom one.


I have seen this happening on IIS when a custom rewrite/redirect script is used, in the case where the server is set up not to forward all requests to the custom script. The requests not forwarded to script and where the URL was not found would serve one type of 404 page whereas custom script may serve a different 404 if it cannot resolve URL.

tedster




msg:4314403
 11:08 pm on May 18, 2011 (gmt 0)

Yes - for example when a section of a site is developed with JSP running on Tomcat. A request for a .jsp URL that can't be found will receive a different handling than a basic request at the .NET level.

g1smd




msg:4314404
 11:08 pm on May 18, 2011 (gmt 0)

Yes, there could be two (or more) different error messages that could be delivered by the server.

One would be the standard server default error message when the file isn't found in the filesystem, and the other could be one that is delivered by a CMS when it has no record in the database. In the latter case, the request has already resolved to the script file so there is no server error involved.

I said there could be more. Simply, .htaccess is a per-directory configuration file, so it is entirely possible to serve different error files for the same type of error when they happen to URLs in different folders.

Angonasec




msg:4314521
 7:01 am on May 19, 2011 (gmt 0)

Thank you for your illuminating replies, they will help others besides me.

Notwithstanding the patent illogicality (to everyone but a geek) of the matter, I've taken g1smd's advice; switched the name of my 404.html file to a secret new name, removed it from my robots.txt rules, and now rely on just the file's meta noindex to keep it out of SEs.

Additionally, because all bots apparently ignore universal robots.txt rules, when they find a UA rule specific to them, I've laboriously repeated all my universal rules for each bot allowed by my robots.txt.

What was a trim 800 character file has now become a blubbery 5,200 characters. Such is the glorious inefficiency the big three SEs demand!

I now expect the "soft 404" crawl "error" to disappear from WMT at short notice, never to reappear.

Ta!

g1smd




msg:4314570
 9:43 am on May 19, 2011 (gmt 0)

While may seem illogical to begin with, what should happen is this. The 404 status code should be returned in the HTTP header for the URL that you asked for, and then the content of the 404 file is included after that header, without changing the URL that you see.

This is in much the same way that when you ask for
example.com/folder/ you get to see the content found inside the index.html file without seeing "/index.html" in the URL in the browser address bar.

Where it usually goes wrong, is that someone puts
ErrorDocument http://www.example.com/404.html in the server configuration file. If you do that, asking for example.com/not-exist results in a 302 redirect to example.com/404.html which is then served with a 200 OK status code. This is documented in the Apache manual. The redirect is the main problem here, but following it with a 200 OK status doesn't help matters at all.

You can partially fix it by ensuring that the 404.html file itself returns a 404 status code in the HTTP header, but the 100% correct fix is to not invoke the 302 redirect at all, by using
ErrorDocument /404.html or whatever filename you choose.

I can only assume that at some time in the past, or under some rare condition, your server does do the redirect thing.

You'll know that is still happening if Google manages to find the new real name for your error file, however the
meta robots noindex directive in the 404 file should also stop that happening.
Angonasec




msg:4315543
 10:53 pm on May 20, 2011 (gmt 0)

g1smd:
"I can only assume that at some time in the past, or under some rare condition, your server does do the redirect thing."

Surely the simpler explanation of what happened in my case was Google bot simply checked the 404.html file listed in my robots.txt file, despite being robots.txt Disallowed from it. (Because I had a Gbot specific UA rule there too.)

Anyway, I've adjusted everything according to your advice and today noticed Gbot got a hard 404 on the now extinct 404.html;

66.249.71.196 - - [19/May/2011] "GET /404.html HTTP/1.1" 404 662 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

So let's see what happens to the GWMT "Soft-404" legacy, it's still there today...

This 31 message thread spans 2 pages: 31 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved