|MSN/Yahoo code 200 - Googlebot code 304 |
I've recently added a large database to a site and am hoping the search engines will crawl as many pages possible. I've been looking through my logs and found that while MSN for example is doing nicely with lots of "200" status codes, when GoogleBot crawls the database pages they come up as 304 redirects. For example:
22.214.171.124 - - [24/Feb/2005] "GET /adirectory/index.php/AnEntry HTTP/1.0" 200 7832 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
OK so far, but with Googlebot I get:
126.96.36.199 - - [24/Feb/2005] "GET /adirectory/index.php/AnEntry HTTP/1.0" 304 0 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
I believe the database uses a lot of internal redirecting, but how come MSN and even minor bots like BecomeBot have no problem finding the pages wheras Google is hitting redirects and going no further? Google eats up the static pages no problem but for the database it's always a 304 redirect, 0 bytes read, and Googlebot stops crawling the page missing the database underneath entirely.
Do I have a problem and what can I do about it?
Thanks for your help,
304 is no redirect but the "Not Modified" status code.
|Not Modified 304 |
If the client has done a conditional GET and access is allowed, but the document has not been modified since the date and time specified in If-Modified-Since field, the server responds with a 304 status code and does not send the document body to the client.
Response headers are as if the client had sent a HEAD request, but limited to only those headers which make sense in this context. This means only headers that are relevant to cache managers and which may have changed independently of the document's Last-Modified date. Examples include Date , Server and Expires .
The purpose of this feature is to allow efficient updates of local cache information (including relevant metainformation) without requiring the overhead of multiple HTTP requests (e.g. a HEAD followed by a GET) and minimizing the transmittal of information already known by the requesting client (usually a caching proxy).
That means that MSN bot is downloading your content again and again, even if it hasn't been changed.
Google bot instead recognizes unchanged content and saves your traffic quota.
NameNick - this is great information, worth the price of admission for sure. I have wondered about this for so long.
A nice discussion [webmasterworld.com] to review, straight from the horse's mouth.
Thanks for the great info, Googlebot and the "If Modified Since" sound a great idea unfortunately in my case Google is not saving time by skipping pages it's already crawled, but rather not crawling new pages at all.
How can I turn off the "IMS header" functionality and get my 200 status codes back?
I would say your best bet is not to turn off IMS, but actually to use it in your favor. You should look into some functionality that alters the IMS to "Now" or "a few minutes ago", so any bot that looks, will see the modified date to be very recent. This will hopefully entice them to crawl your pages, and you would get your 200's back.
Intelligent reply, it's always a buzz when you finally cut through to the good information and smart ideas. Will do my best to put this into action - thankyou :)