|How does Google react to HTTP error codes?|
I have the bad habbit to sometimes make changes live in htaccess on my site. I had a small issue and Google hit 3 of my pages and has been sent a 403.
Should I expect Google to try again or should I relocate those URLs?
If we search on google for "http error codes" it will bring a w3 page which states:
|10.4.4 403 Forbidden |
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.
As you see they emphase "SHOULD NOT be repeated". It's not a MUST NOT but ... pretty close?
And if you can talk about how google reacts on 410, 500, 404, 302, 301. Will a 401 (Gone) receive requests in the future? What about 500 (Server Problems)? Or a 404 (Not Found)?
And do mention if any of the error codes will prevent google from ever trying to get a page.
I can talk from personal experience on 500 error. My server has been giving 500 error instead of 404 (file not found). Don't ask why.
Anyway, I didn't have a robots.txt on the site. I didn't bother because there was nothing that I didn't want the robots to crawl.
So the bots would turn up, get 500 error on robots.txt and go away. This was stopping my site from being crawled and indexed by Google and Yahoo.
However, they would come back the very next day (or sooner) and try again. After I fixed the problem, I was indexed pretty quickly.
So the answer to 500 server error is:
The bot will leave your site immediately and try again later.
As for 404 file not found:
If there's a link taking the bot to the missing file, it will try to follow the link again on the next crawl.
301: obviously the bot follows the redirect just like users, and treats the new site as the old.
[edited by: callivert at 12:46 am (utc) on Aug. 24, 2007]
I'll give it my best shot:
301 - Google will index the target url and its information
302 - Google will index the original url, but with the target url's information. This is only true within the same domain. If the 302 is on a different domain than its target, then Google will index the target url and the target url's information
403 - Google will try again. They know mistakes happen
404 and 410 - Treated identically, at present. Google will continue to spider the url. Any previous content from those urls may hang around in the Supplemental Index for several months or more, but it will be removed from the regular index rather quickly.
500 - Google will try again.
In general, Google tries over and over to be sure the urlis really gone. Googlebot has a huge appetite for urls and doesn't want to miss anything.
Google still hasn't figured out 410-Gone. They want to 'forgive' Webmaster errors, so they keep trying. But they could save an awful lot of wasted effort by trying again in one day, then two days later, then four, then 8, say up to 128 days, or six months or so, and then after a few years call it "really, really gone."
As it is, they'll keep trying no matter what error code you give them.
Reassuring words tedster & jim :)
I think this might have caused my supplemental x-files event on that page I mentioned on a different thread.
I'll keep you updated on how soon it'll try to crawl again the 403ed pages.
Otherwise it takes about 15 minutes to rewrite my entire directory structure of my site so ... as a last resort ... I'll do it.
[edited by: TheSeoDude at 1:15 am (utc) on Aug. 24, 2007]
Just checked my logs and one page 403ed yesterday was taken again today.
So for all those then funk up once in a while there's still hope ... unless you 301.
On 301 will google ever look at the page that issued it or will he just replace it with the new url permanently? This is how he should do ...
PS: I'm addressing a robot with "he". I'm lost! He might be a she. It would explain the moods.
|I think this might have caused my supplemental x-files event |
Yup - I recently mentioned a client's PR7 page that went supplemental - that turned out to be a redirect that went wrong technically. We'll see how fast things get fixed.
|On 301 will google ever look at the page that issued it |
Yes, and regularly. Since 301 redirects can transfer backlink influence of all kinds, if that redirect goes away, Google wants to know about it.
|Yes, and regularly. Since 301 redirects can transfer backlink influence of all kinds, if that redirect goes away, Google wants to know about it. |
What if the 301 is in-site, not to external url? Will it re-check it too? It makes sense checking a redirect periodically it if it goes somewhere else but if a site owner marks a page permanently redirected within site will it also be checked again? According to specs it should not.
nternal or external, googlebot keeps checking every url it knows about.
|In general, Google tries over and over to be sure the urlis really gone. Googlebot has a huge appetite for urls and doesn't want to miss anything. |
I've noticed that some CMS software will temporarily lock a page while you're editing it - if Googlebot or somebody comes along and hits that page, they get a 404 until you are done. I think WordPress does this. It's not that big of a deal, but I helped a friend track down why some pages were showing in his logs as 404 even though they weren't, and this was why.
|404 and 410 - Treated identically... |
I'm just back from SES San Jose, and I'd asked about this at the Meet the Crawlers session. It turns out that Google, Yahoo, MSN, and Ask all treat 404s and 410s the same.
|Google still hasn't figured out 410-Gone. They want to 'forgive' Webmaster errors, so they keep trying. |
Yes, this was the reason several engineers gave... that we'd be amazed at the innappropriateness of the headers they see, so treating a 410 as 404, in their eyes, is safer.
Not to get into a long discussion about Ask, Ask recommended that for their engine, instead of a 410, you use a robots.txt disallow to remove a page from the index.
|I've noticed that some CMS software will temporarily lock a page while you're editing it - if Googlebot or somebody comes along and hits that page, they get a 404 until you are done. |
I believe it was Google that suggested you send a 500 or a 503 when you anticipate temporary site problems, and the other engines apparently concurred.
|301 - Google will index the target url and its information |
Another aside about Ask.... I have some 301s that Ask hasn't followed for some three years now, and I asked them specifically whether this might in fact be on purpose rather a glitch, and they said that yes, it might have been intentional. If they see something on a site that suggests to them that you might be buying an old domain and rebranding it, they might not follow the 301. We didn't have time to go into the specifics at the session... I'll be contacting the engineer... nor did I have time to ask the other engines whether some of their apparent 301 glitches might be intentional as well, but the Ask answer does suggest that's a possibility.
No webmaster can guarantee that a URL is gone "forever".
Forever is a very long time.
I would always expect search engines to recheck "gone" URLs from time to time in order to see if they ever came back... a year, a decade, or a century later...
What if there was a URL on your site that didn't ever get indexed, and you could not understand why; and the answer was that the URL had been marked "Gone Forever" several decades ago and three owners back?
What if you bought an expired domain that while parked had returned "Gone" for every request, for every one their tens of millions of previously indexed pages, including root? Forever, would mean they would never index your new site.
Good clarifications Robert and good points g1smd ( I like you name initials ;) ).
Indeed specs are made for machines and error-free environments, not for humans.