|What exactly do those WMT "Crawl errors" mean?|
I posted before [webmasterworld.com] about a sudden influx of 404 errors in WMT. But, as I dig further into this, I'm finding that these 404s may only be a part of my problem so I decided to start a new thread and look at this at a different angle.
Before I get to the point, I wanted to add another bit of info: I'm talking about a site that had lost 50% of its Google referrals. The troubles started on or around July 26th, so I'm not too certain if I can pin it on Panda - seems a couple of days off.
So, I'm seeing a considerable amount of other types of errors as well that makes me thing that there's something going on on the technical level (before I blame it all on Panda and move on as people here suggested :)
Here is the list of errors. Some are rather self-explanatory but some are ambiguous, so I would appreciate it you can fill in based on your previous experience:
404 (Not found) which I hear could mean not only the obvious 404 but also 410 and even 301. Does not sound like a big stretch to me to think that anything in the 400s (except 403) would fall (totally inappropriately if you ask me) into this category
Connection refused - would this be a 403? I can't confirm because I don't see a hit by Googlebot on the date shown on the URL reported with a 403 status code. There are normally just no hits on that URL at all. Where they get "refused" in this case remains a mystery to me
No response - server down? Strangely, it shows dates when I know for certain the server was up 100%
Failed to connect - physical node up but Apache is down? Just a speculation, otherwise I see no difference between it and "no response". Again, the date shown is the day I know it was up 100% of the time
Network unreachable - DNS and possibly the provider's routing errors and such? This makes sense because I'm finding out that one of my nameservers has been down for a l-o-o-ong time (quite possibly over a year) and I think it can cause DNS problems intermittently if the whole server is only supported by one nameserver.
Redirect error - usually cyclic redirects. All instances of this error seem to be reported correctly, so no quibbles here.
500 error - usually a programming error. Seems to be reported correctly all the time.
So, that's the list. I see A LOT of 404, connection refused, failed to connect, network unreachable errors, otherwise wouldn't have started this thread. So I can't just dismiss it as a fluke. Also, date reporting in WMT seems to be out of whack - shows no errors on the days I know server has issues and errors on the day I was watching it all the time and saw nothing wrong.
Other than perhaps 500 and Redirect errors, there seems to be quite a bit of ambiguity about what the errors actually mean, so I would appreciate if people can fill in better descriptions if you had to deal with it before.
P.S. I did try WMT Help first. Google employees' responses are usually as ambiguous as the error details in WMT ...
Are your "connection refused" and "network unreachable" errors mostly on August 4th? I suddenly see a number of those on a site previously without problems.
There have been a number of occasions reported here in the past where Google could not reach a site and it was later found that the bottleneck was somewhere at Google's end.
404 (Not found) - It is concerning that URLs returning 410 and 301 are reported as returning 404. I don't know why Google does this.
Connection refused - Not sure if this is 403, or some other failure at the HTTP level.
The rest of your designations seem to be pretty much spot on.
I wouldn't say specifically on the 4th. I see a lot of them dated anywhere between 3rd and 11th. In fact, 8th looks more like the epicenter to me.
|Are your "connection refused" and "network unreachable" errors mostly on August 4th? I suddenly see a number of those on a site previously without problems. |
I see a lot of "
Network unreachable" and "
robots.txt unreachable" errors on 2011 August 4th and several "
Connection refused" errors on both the 10th and 11th.
Having the exact time of these events would be very useful, but that data isn't forthcoming.
1script - you say one of your DNS servers failed. Could that be a lot (not all) of the trouble?
If google failed to route to your working DNS server and couldn't find an alternative it could have returned errors; whereas, because YOU could get your site via that DNS (probably cached locally by you) you considered it live. As, quite probably, did a lot of other people who could route through that DNS server.
This would particularly explain why your logs have no entries for certain times that google claims you had errors.
It would also be worth checking that there are no problems with your working DNS server (and presumably, now, your newly working secondary one).
In the UK I use intodns.com to check DNS faults. It would be worth finding US-based tracers as well.
I think you're spot on! I didn't think about the possibility that they could be using (trying to, anyhow) a different nameserver, maybe even intentionally as to, for example, be nice and not overload the first - ns1.example.com. I've noticed long time ago that Google is always the first to start visiting a site on its new IP address in case it got changed. My guess is that they don't cache DNS queries for long (or at all?) and keep polling very often. That would make them more responsive to DNS changes (and a little more susceptible to DNS poisoning? but I digress).
Sadly, my ns2.example.com hasn't been fixed yet - its IP got lost in a shuffle (long story) - but at least I know it's not a low priority task and will keep nagging the tech support until it's fixed.
We see a lot of unexpected Network Unreachable errors on August 6th
If you still had reasonable levels of traffic from real visitors, the problem was more likely at, or closer to, Google's end.
I just looked; I have a ton of "network unreachable" errors for two sites, August 10-12, and a handful of "robots.txt unreachable" errors for the same dates. I see no dip in my stats for those two sites, plus they're hosted (with a third, larger site) on extremely reliable hosting - plus the third site doesn't report any errors at all.
I think it's gotta be Google. Also seeing lots of 404 errors for stuff that appears to come up just fine.
1script - were it my SE I would only cache for a short-ish time, possibly for the length of a full scan. If, on the next scan day, they retried the domain name lookup and your working DNS server was unreachable for some reason, they would probably ignore you for a day or so?
The same may not be true of a human, who may well hit the Try Again button on the browser and possibly get lucky, depending on the reason for the response failure and the actual delay involved.
You could always add a third/fourth DNS server to your domain regardless of your second one not working. This is fairly easy to do (there are third party DNS servers available for free or small fee) if you have access to the domain's setup. If you have to rely on your hosting company for this then ask them to do it.
Anyone seeing discrepancy in "Links to your site"? I am seeing it reporting almost no links for some sites. I don't know whether it is related to these crawl errors.
Regarding the 404 errors, this is yet another mystery that has been around this year. Most of them would pertain to urls that have been removed long ago or not existing at all on your site.
A couple of years ago Matt cutts had this on his post - [mattcutts.com...]
|Why would you care about this? The simple reason is that if someone is linking to a non-existent page on your site, it can be a bad experience for users (not to mention that you might not be getting credit for that link with search engines unless you’re doing extra work). |
But i think that google is now asking us to ignore them and that 404s won't hurt. It is all confusing and changing frequently.
[edited by: indyank at 6:13 am (utc) on Aug 15, 2011]
I think Matt Cutts blog should be pandalized (has it or not?) as there are several outdated information that is not useful for the users.In fact some would be misleading.
Here's how I understand it.
If you see a 404 link coming from a good site, then finding a way to redirect it can give you credit for that backlink and give any traffic it generates a good experience. However, ignoring a 404 coming from another site will not actually HURT you, it just might be a missed opportunity.
Of course, lots of 404 links coming from your own site would be a very negative signal.