Forum Moderators: open

Message Too Old, No Replies

last-modified useless

googlebot requests month old files

         

plasma

4:22 pm on Sep 24, 2003 (gmt 0)

10+ Year Member



Lest week I wondered why our cgi's caching algorithms were broke.

When requesting files via an ordinary browser (mozilla, ie, ...),
the browser correctly sent the if-modified-since header and my cgi replied with "304 Not Modified".

The onlyone who was "always" being served with "200 OK" was googlebot.
So I logged the traffic and yesterday [Wed Sep 24 05:05:55 2003] got the following debug-log

-------------------------------------------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
Status: 200 OK
-------------------------------------------------------------------

Googlebot requests a page that is 1.5 months old.
I know that it already has the latest version.

It seems like the googlebots have _very_ different indexes.
Are they just poorley synchronized, or is there a reason for that?

Google writes in the guidelines that we should support the last-modified header
to save bandwith.

They could save even more bandwith for all of us if they just synchronized their bots better ;)

plasma

1:06 pm on Sep 25, 2003 (gmt 0)

10+ Year Member



64.68.82.18 - - [16/Sep/2003:08:43:07 +0200] "GET / HTTP/1.0" 200 13840 "-" "Googlebot/2.1
64.68.82.168 - - [17/Sep/2003:13:34:23 +0200] "GET / HTTP/1.0" 200 13837 "-" "Googlebot/2.1
64.68.82.169 - - [18/Sep/2003:16:25:51 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.28 - - [19/Sep/2003:07:18:24 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.168 - - [20/Sep/2003:04:12:43 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.7 - - [21/Sep/2003:06:24:34 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.170 - - [22/Sep/2003:05:58:46 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.169 - - [23/Sep/2003:02:42:22 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.18 - - [24/Sep/2003:05:05:59 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.143 - - [25/Sep/2003:08:08:35 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1
64.68.82.169 - - [25/Sep/2003:08:35:17 +0200] "GET / HTTP/1.0" 200 14046 "-" "Googlebot/2.1

As you see, googlebot comes daily but still tries to validate against a months old version.
Is it a bug?

Today it grabbed the index page twice within 30 Minutes:

---------------------------8<---------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
Status: 200 OK
---------------------------8<---------------------------------

---------------------------8<---------------------------------
USER-AGENT: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
Last-Modified: Sun, 21 Sep 2003 23:12:46 GMT
ETag: 2053
---------------------------8<---------------------------------

Now that is really interesting:
on [24/Sep/2003:05:05:59 +0200] *.18 looked for: IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
on [25/Sep/2003:08:08:35 +0200] *.143 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
on [25/Sep/2003:08:35:17 +0200] *.169 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT

Seems to me like google updates each index when it is 50 days old without synchronizing among themselves.

Isn't this interesting?

midwestguy

3:17 pm on Sep 25, 2003 (gmt 0)

10+ Year Member



Maybe Google doesn't feel they can trust that a URL/page hasn't been modified. Else, one could effectively do a bait and switch after getting a "nice" page indexed, setting the last modified to deflect Google from re-indexing the page, then replacing the "nice" page with a "bad" one. Guess that would make a good "Cloaking made easy with Googlebot" e-book opportunity for someone. ;-)

Just a thought.

plasma

3:34 pm on Sep 25, 2003 (gmt 0)

10+ Year Member



Guess that would make a good "Cloaking made easy with Googlebot" e-book opportunity for someone. ;-)

Then why does google demand us to support last-modified ;)

[edited by: plasma at 6:22 pm (utc) on Sep. 25, 2003]

midwestguy

4:13 pm on Sep 25, 2003 (gmt 0)

10+ Year Member



Maybe they use another bot with a spoofed user agent (something other than googlebot) to do a sampling or test of sites to see who they can "trust" not to use last-modified for cloaking....*then* enjoy the bandwidth savings (at least for a period of time till they test for cloaking again) for those sites their sampling and statistical/probability models deem trustworthy...or at least low risk enough to trust...for awhile, at least.

Just a possibility (i.e. guess) that comes to mind in answer to your question. I certianly don't know, of course.

Yidaki

5:06 pm on Sep 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Maybe they use another bot with a spoofed user agent (something other
>than googlebot) to do a sampling or test of sites to see who they can "trust"
>not to use last-modified for cloaking

I wouldn't underestimate the google techies. They are not as lame as such webmasters. They just have to compare a checksum of old and new ... boom. No need to fake ua.

midwestguy

5:53 pm on Sep 25, 2003 (gmt 0)

10+ Year Member



"No need to fake ua."

Wouldn't that depend on if the cloaking site was feeding the googlebot UA or IP the old "just for you, and nice, too" page, with everyone else with a non-googlebot UA/IP getting the "not so nice" page?

plasma

12:45 pm on Sep 26, 2003 (gmt 0)

10+ Year Member




on [24/Sep/2003:05:05:59 +0200] *.18 looked for: IF-MODIFIED-SINCE: Tue, 05 Aug 2003 00:43:16 GMT
on [25/Sep/2003:08:08:35 +0200] *.143 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT
on [25/Sep/2003:08:35:17 +0200] *.169 looked for: IF-MODIFIED-SINCE: Wed, 06 Aug 2003 02:36:00 GMT

on [26/Sep/2003:08:30:31 +0200] *.204 looked for: IF-MODIFIED-SINCE: Thu, 07 Aug 2003 05:01:34 GMT

Again: Exactly 50 days after last-modified

Can this be influenced by REVISIT_AFTER?

dirkz

3:43 pm on Sep 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Maybe too many websites fail to deliver a decent last-modified, so the bots just download the whole page per default.

What I'm just seeing is that due to their decentralized system some Googlebots are crawling pages for which I changed the directory path a few days ago. The pages are still there but not linked to anymore. I changed the directory path a few days before I was deep-crawled but left the old ones there and I'm still getting traffic via them (the link to the main page is of course still functional).