Forum Moderators: DixonJones
I've recently added a large database to a site and am hoping the search engines will crawl as many pages possible. I've been looking through my logs and found that while MSN for example is doing nicely with lots of "200" status codes, when GoogleBot crawls the database pages they come up as 304 redirects. For example:
207.46.98.129 - - [24/Feb/2005] "GET /adirectory/index.php/AnEntry HTTP/1.0" 200 7832 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
OK so far, but with Googlebot I get:
66.249.64.6 - - [24/Feb/2005] "GET /adirectory/index.php/AnEntry HTTP/1.0" 304 0 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
I believe the database uses a lot of internal redirecting, but how come MSN and even minor bots like BecomeBot have no problem finding the pages wheras Google is hitting redirects and going no further? Google eats up the static pages no problem but for the database it's always a 304 redirect, 0 bytes read, and Googlebot stops crawling the page missing the database underneath entirely.
Do I have a problem and what can I do about it?
Thanks for your help,
Jeremy
304 is no redirect but the "Not Modified" status code.
From w3.org
Not Modified 304If the client has done a conditional GET and access is allowed, but the document has not been modified since the date and time specified in If-Modified-Since field, the server responds with a 304 status code and does not send the document body to the client.
Response headers are as if the client had sent a HEAD request, but limited to only those headers which make sense in this context. This means only headers that are relevant to cache managers and which may have changed independently of the document's Last-Modified date. Examples include Date , Server and Expires .
The purpose of this feature is to allow efficient updates of local cache information (including relevant metainformation) without requiring the overhead of multiple HTTP requests (e.g. a HEAD followed by a GET) and minimizing the transmittal of information already known by the requesting client (usually a caching proxy).
That means that MSN bot is downloading your content again and again, even if it hasn't been changed.
Google bot instead recognizes unchanged content and saves your traffic quota.
Best regards
NN
Thanks for the great info, Googlebot and the "If Modified Since" sound a great idea unfortunately in my case Google is not saving time by skipping pages it's already crawled, but rather not crawling new pages at all.
How can I turn off the "IMS header" functionality and get my 200 status codes back?
Thanks,
Jeremy