Googlebot - Some Tough Questions - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Googlebot - Some Tough Questions

What does he see & what influences him?

jbgilbert

3:47 pm on Oct 20, 2003 (gmt 0)

10+ Year Member

A lot has been covered in this forum (and other forums) on this subject, but I've see NO REAL answers. My guess is nobody really knows or (if they do know) they won't share the info.

So, here goes all you experts... can you answer these questions:

1) When Googlebot visits say an index page, he has lots of HTTP header info delivered to him from the server. He has different info available depending on the server, server config, etc. so Googlebot must compensate on his crawl decisions under a variety of situations. Lets assume for a moment that the "server delivered" "last modified date" is not there......

-- What generic information in the HTTP header may tell Googlebot that this particular index page seems to be fresh and "needs to be fully crawled"?

-- Is the size of the index file available to Googlebot when he comes? (I can't see that with any of my tools)

-- Since there are so many situations to deal with, is it possible that Googlebot ALWAYS grabs the index page header area to use in making his crawl decisions?

More questions to come, but if real answers to the above 2 questions cannot be found, no need to go further.

plasma

5:18 pm on Oct 20, 2003 (gmt 0)

10+ Year Member

-- What generic information in the HTTP header may tell Googlebot that this particular index page seems to be fresh and "needs to be fully crawled"?

ETag
max-age
There are tons of other methods.
I suggest reading the http RFC

-- Is the size of the index file available to Googlebot when he comes? (I can't see that with any of my tools)

Content-Length
Everybody should support it. Speeds up things enormously (persistant connections) and is 0 work to implement

bhartzer

6:06 pm on Oct 20, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The problem here, is, that if you have SSI implemented, then none of those are available.

dirkz

10:16 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I know that Google themselves recommend "last-modified", though I've never seen that in a log file (I would expect a "HEADER" request).

The strategy Googlebot seems to employ is to "swallow it and let others compare it with the old version".

Dreamquick

10:21 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

erm last-modified can be included in a standard GET request since it's a request header rather than a request type, it's then up to your server software to determine if the content has changed and either serve a 304 (aka requested resource not changed your old copy is still good) or continue on and serve up the page normally.

Using a specialised HEADER request simply means that GoogleBot would have two issue two requests to get every single page instead of the one GET it requires at the moment.

However as bhartzer points out if you use SSI / server-side code then you need to put a lot of thought into it to ensure that your server software understands what your code is doing well enough to return 304s when you want it to.

- Tony

dirkz

3:36 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks Tony for the clarification.

plasma

4:46 pm on Oct 21, 2003 (gmt 0)

10+ Year Member

I know that Google themselves recommend "last-modified", though I've never seen that in a log file (I would expect a "HEADER" request).

That may be due to the unsynchronized googlebots that each care only for their very own index.
googlebot n doesnt care for googlebot n+1 having a more up to date version of a file.
It will revisit 50 days after it's own file has been modified.
If your pages change in intervals <50 days you will most likely never see a 304 on these pages.

I don't know why google does this as it clearly works against (violates?) the intention of saving bandwith.
I just don't understand why they demand us to implement it then?

However, I mentioned that often enough here, nobody cared.