Forum Moderators: open
When requesting files via an ordinary browser (mozilla, ie, ...),
the browser correctly sent the if-modified-since header and my cgi replied with "304 Not Modified".
The onlyone who was "always" being served with "200 OK" was googlebot.
So I logged the traffic and yesterday [Wed Sep 24 05:05:55 2003] got the following debug-log
-------------------------------------------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
Status: 200 OK
-------------------------------------------------------------------
Googlebot requests a page that is 1.5 months old.
I know that it already has the latest version.
It seems like the googlebots have _very_ different indexes.
Are they just poorley synchronized, or is there a reason for that?
Google writes in the guidelines that we should support the last-modified header
to save bandwith.
They could save even more bandwith for all of us if they just synchronized their bots better ;)
As you see, googlebot comes daily but still tries to validate against a months old version.
Is it a bug?
Today it grabbed the index page twice within 30 Minutes:
---------------------------8<---------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
Status: 200 OK
---------------------------8<---------------------------------
---------------------------8<---------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
---------------------------8<---------------------------------
Now that is really interesting:
on [24/Sep/2003:05:05:59 +0200] *.18 looked for: IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
on [25/Sep/2003:08:08:35 +0200] *.143 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
on [25/Sep/2003:08:35:17 +0200] *.169 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
Seems to me like google updates each index when it is 50 days old without synchronizing among themselves.
Isn't this interesting?
Just a thought.
Just a possibility (i.e. guess) that comes to mind in answer to your question. I certianly don't know, of course.
I wouldn't underestimate the google techies. They are not as lame as such webmasters. They just have to compare a checksum of old and new ... boom. No need to fake ua.
on [24/Sep/2003:05:05:59 +0200] *.18 looked for: IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
on [25/Sep/2003:08:08:35 +0200] *.143 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
on [25/Sep/2003:08:35:17 +0200] *.169 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
on [26/Sep/2003:08:30:31 +0200] *.204 looked for: IF-MODIFIED-SINCE: Thu, 07 Aug 2003 05:01:34 GMT
Again: Exactly 50 days after last-modified
Can this be influenced by REVISIT_AFTER?
What I'm just seeing is that due to their decentralized system some Googlebots are crawling pages for which I changed the directory path a few days ago. The pages are still there but not linked to anymore. I changed the directory path a few days before I was deep-crawled but left the old ones there and I'm still getting traffic via them (the link to the main page is of course still functional).