Forum Moderators: open
So, here goes all you experts... can you answer these questions:
1) When Googlebot visits say an index page, he has lots of HTTP header info delivered to him from the server. He has different info available depending on the server, server config, etc. so Googlebot must compensate on his crawl decisions under a variety of situations. Lets assume for a moment that the "server delivered" "last modified date" is not there......
-- What generic information in the HTTP header may tell Googlebot that this particular index page seems to be fresh and "needs to be fully crawled"?
-- Is the size of the index file available to Googlebot when he comes? (I can't see that with any of my tools)
-- Since there are so many situations to deal with, is it possible that Googlebot ALWAYS grabs the index page header area to use in making his crawl decisions?
More questions to come, but if real answers to the above 2 questions cannot be found, no need to go further.
-- What generic information in the HTTP header may tell Googlebot that this particular index page seems to be fresh and "needs to be fully crawled"?
ETag
max-age
There are tons of other methods.
I suggest reading the http RFC
-- Is the size of the index file available to Googlebot when he comes? (I can't see that with any of my tools)
Content-Length
Everybody should support it. Speeds up things enormously (persistant connections) and is 0 work to implement
Using a specialised HEADER request simply means that GoogleBot would have two issue two requests to get every single page instead of the one GET it requires at the moment.
However as bhartzer points out if you use SSI / server-side code then you need to put a lot of thought into it to ensure that your server software understands what your code is doing well enough to return 304s when you want it to.
- Tony
I know that Google themselves recommend "last-modified", though I've never seen that in a log file (I would expect a "HEADER" request).
That may be due to the unsynchronized googlebots that each care only for their very own index.
googlebot n doesnt care for googlebot n+1 having a more up to date version of a file.
It will revisit 50 days after it's own file has been modified.
If your pages change in intervals <50 days you will most likely never see a 304 on these pages.
I don't know why google does this as it clearly works against (violates?) the intention of saving bandwith.
I just don't understand why they demand us to implement it then?
However, I mentioned that often enough here, nobody cared.