Forum Moderators: open
Used to be I'd see the deepcrawl in the first 10 days of the month with about 50K pages pulled down, then freshy would come around every 4 - 5 days for a few thousand pages.
Since the changes, I see Gbot come every other day for maybe 10K pages and then on the off days, she comes for just a few hundred. Very rarely do I have a full 24 hours when she doesn't even visit me. This pattern was steady for a few months and then this week Gbot stayed on my site for 3 straight days pulling 15-20K pages each day. When she was gone, she had pulled down more than 50K, similar to the old deepcrawl.
Then yesterday, an off-day, she pulled down about 1,000. Today she is back and starting up the heavy stuff again, having already pulled down 10K pages today.
I'm not wondering about IPs of the bots but rather if others have seen a similar pattern or perhaps a different one. I apologize if this is already posted elsewheres in another thread. I didn't see it.
64.68.86.9 - - [12/Sep/2003:11:38:55 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:38:55 -0700] "GET /Oceanography.html HTTP/1.0" 200 16664 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:39:45 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:39:45 -0700] "GET /Ecology.html HTTP/1.0" 200 13935 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:40:39 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:40:39 -0700] "GET /Political_Science.html HTTP/1.0" 200 10552 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:41:24 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:41:25 -0700] "GET /Libraries.html HTTP/1.0" 200 14559 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:04 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:04 -0700] "GET /Criminology_Drug-Awareness.html HTTP/1.0" 200 12065 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:42 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:43 -0700] "GET /Encyclopedia.html HTTP/1.0" 200 6666 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:43:18 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:43:19 -0700] "GET /Search-Engines_General.html HTTP/1.0" 200 13121 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:44:01 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:44:01 -0700] "GET /Government.html HTTP/1.0" 200 18127 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:44:30 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:44:31 -0700] "GET /Legal_Search.html HTTP/1.0" 200 12243 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:45:01 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:45:02 -0700] "GET /Earth-Space.html HTTP/1.0" 200 11947 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:45:38 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:45:39 -0700] "GET /Botanical_M-Z.html HTTP/1.0" 200 14585 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Each machine calls for 'robots.txt' before the file.
Yet, just short of an hour earlier:
64.68.87.43 - - [12/Sep/2003:11:09:22 -0700] "GET /History_A-I.html HTTP/1.0" 200 12744 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:27 -0700] "GET /Law_Schools.html HTTP/1.0" 200 11416 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:29 -0700] "GET /Environmental_A-E.html HTTP/1.0" 200 18969 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:30 -0700] "GET /Maps.html HTTP/1.0" 200 7948 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:09:53 -0700] "GET /Human_Resource_Management-Bullying.html HTTP/1.0" 200 4873 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" ...and just before that.
64.68.87.43 - - [12/Sep/2003:11:07:55 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:07:55 -0700] "GET /Radio.html HTTP/1.0" 200 8228 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:07:59 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:07:59 -0700] "GET /Recycle.html HTTP/1.0" 200 3653 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" ...back to tit-for-tat calls.
(Total this week from Google: Near a thousand.)
Seems a bit excessive.
Pendanticist.
Does this have something to do with G's distributed system and storage capacity? Ain't not engineer but seems like in a well thought out system this wouldn't happen.
>>Total this week from Google: Near a thousand.
Yep, and that's just counting the calls for robots.txt. On a deep crawl it's at least that many per day. Sure am glad Google has all that extra bandwidth to play with.
jimbeetle,
> but I just can't see why the same machine has to make constant calls for it.
There could be multiple machines behind those IP addresses - lots of port addresses available on each IP besides port 80... (If *I* spidered your site from my house, you'd see multiple requests for robots.txt from the same IP address, because all four computers are proxied through the same connection to my ISP.)
Jim
Duh! As always, you remind me of the obvious that I've overlooked or not thought through. Thanks JD.
One of the things I'm curious about here is that the constant fetching of robots.txt makes G appear to be almost paranoid of violating the protocol, i.e., checking robots.txt each time before fetching a file.
On the other hand, G has a somewhat loose interpretation of robots.txt protocol when it comes to indexing pages that are disallowed -- the "We're really not indexing pages that are disallowed, just pointing out that they exist" reasoning. This forces webmasters to "allow" access to the pages and then use meta robots noindex tags on each to prevent listing in G's index (as JD has previously pointed out).
Makes for somewhat cumbersome site management when faced with this "two-headed" bot: on the one hand its scrupulously repetitive checking of robots.txt which, over time, unnecessarily bleeds bandwidth and, on the other hand, its very narrow interpretation of the robots.txt protocol.
(Got distracted and somewhat lost when writing above, hope it's understandable.)
Overnight the request pattern continued with the IP calling for robots.txt also making the page request.
She rested for a little while, then returned to working even more zealously:
Note: (in one second...)
64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
In one second intervals again...
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 1879 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 10628 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 20031 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 4185 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
So as to not be left out, this IP Number decides to join the foray...again in one second.
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
I'm not sure how much of a server drain this causes, but you'd think Google'd at least ride herd a bit better.
Pendanticist.
truth_speak,
It is possible that your server is not sending Last-Modified headers. Also, if your server is set up to send very short-term Expires headers, that can cause problems. Another troublemaker is the cache-control header.
If any of these are misconfigured, Google may think you WANT them to check for changes often. And since you have high enough PR to get fresh-botted daily, they will.
Check your server headers here [webmasterworld.com]. Look for the headers named above.
Jim
One might then wonder why does this anomoly occur sporadically?
...Last-Modified...:
Sun, 10 Aug 2003 16:32:52 GMT ...send very short-term Expires headers...:
Expires: Sun, 14 Sep 2003 03:16:11 GMT Another troublemaker is the cache-control header...:
Expires: Sun, 14 Sep 2003 03:16:11 GMT I just took these readings a few minutes ago and I see that both 'cache-control' & 'very short-term Expires' headers expiring within 24 hours and Last-Modified being better than a month ago (<-which seems accurate).
Are these copescetic?
Pendanticist.
Those Expires headers look like a problem. I'd set Expires to half the average frequency of your page updates or one week, whichever is shorter. On the other hand, I set my own images to expire after 90 days, because I almost never, ever change them -- I might change to a new image, but then I'd use a different filename, so that would be cached separately with its own Expires date.
The cache-control header (if present) might say something like "public" or "must-revalidate". If it says, "no-cache" that could be a problem, because the page won't be cacheable at all. If that's not waht you want, it's a problem.
This stuff is controlled by mod_headers and mod_expires in Apache.
Why does it happen sporadically? Because like Windows crashes, it is a software bug (my opinion).
Jim
[ircache.net...]
It'll tell you if you're cool or not :)
I find it interesting that none of the machines (IP Numbers) summarily requesting robots.txt yesterday are crawling today.
They are:
64.68.86.79 - - [12/Sep/2003:21:57:25 -0700] "GET /Blahblah.html HTTP/1.0" 200 5239 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:21:57:34 -0700] "GET /Blahblah.html HTTP/1.0" 200 15997 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:21:57:37 -0700] "GET /Blahblah.html HTTP/1.0" 200 17064 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:21:58:19 -0700] "GET /Blahblah.html HTTP/1.0" 200 9960 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" While I will certainly take steps to improve cachability, I'm not all together sold on this being the true/total root cause, given today's flawless activity by other machines.
Perhaps GoogleGuy'll grace us with some additional info as to these respective machines past behaviours?
I should point out that all machines roving my files today, are very well mannered requesting them at between six (6) and fifteen (15) second intervals.
Pendanticist.