Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google WMT reports custom 404 page as a soft-404

         

Angonasec

10:27 pm on May 14, 2011 (gmt 0)



Virtual server site, all flat-file, static html.

For years I've had a simple 662 byte custom 404 page working fine with no problems, no funny business whatsoever. No meta refreshes, just a search box, and a few words.

Type a duff url for our site, and up pops the custom /404.html page every time.

Today, in our Google Webmaster Tools console, I noticed my first ever "Soft 404" (meaning a page that returns a 200 server response, instead of a genuine 404 page not found server response.) Just the one.

The page Google is showing in crawl errors as a "soft 404" is my custom /404.html page, thus;

Crawl errors: Soft-404
www.mysite.tld/404.html 404-like content May 11, 2011

The 404.html page is of course NOT linked-to anywhere on my site, and it has always had a meta name="robots" content="noindex, noarchive, nofollow" tag to prevent spiders including it.

In my root .htaccess file there's always been the directive:

ErrorDocument 404 /404.html

Additionally, I've always disallowed all bots, via robots.txt, from /404.html
User-agent: *
Disallow: /404.html

So Googlebot should never have crawled that page +directly+, but it did, here's the relevant log entry:

66.249.72.74 - - [11/May/2011:23:42:59 -0400] "GET /robots.txt HTTP/1.1" 200 1221 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.74 - - [11/May/2011:23:42:59 -0400] "GET /404.html HTTP/1.1" 200 662 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


It seems Google now expect the true url of a custom 404 page to return a 404 response.

What lunacy is this?

g1smd

11:35 pm on May 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I find it takes about 3~7 days for WMT to update for some reports, others a little more.

Yes, you are right, they may have got the idea from the robots.txt file. I had forgotten you had mentioned that. Noting there was something that is disallowed to other bots but seemingly allowed to Google, maybe they actively pulled the file even though you don't link to it from HTML pages within the site. If that's true, then that's a previously undocumented attack vector - but not unsurprising given Google's predatory attitude to finding data on the web.
This 31 message thread spans 2 pages: 31