Forum Moderators: open
Yesterday, a Googlebot/2.1 crawler coming from 66.249.66.205 literally sieged my website. In one day it requested more than 250,000 pages from this website. If I were to calculate the average requests per second, (250,000/(24*3600)), this issues around 2,89 requests per second, which is not that much. However this particular crawler performed batched requests; this means that it paused for minutes, and then ferociously read dozens of webpages again. I hoped it does not exceed a threshold of 10 pages per second; or 20, you would think this would kill any database driven website. My hopes were wrong:
cat my-really-nice-website.com.20041029 ¦ grep Googlebot ¦ awk -F - ‘{print $3}’ ¦ sort ¦ uniq -c ¦ sort -r ¦ head
47 [29/Oct/2004:21:33:44
44 [29/Oct/2004:08:23:21
41 [29/Oct/2004:21:33:42
41 [29/Oct/2004:08:36:29
41 [29/Oct/2004:08:23:19
40 [29/Oct/2004:08:43:23
40 [29/Oct/2004:08:36:25
40 [29/Oct/2004:08:34:20
39 [29/Oct/2004:10:12:13
39 [29/Oct/2004:09:22:17
As this small measurement shows, there were seconds when the crawler exceeded 40 requests per second; and once issued 47 requests per second
Note: this IP definitelly belongs to Google . There are no new links pointing to my website, nothing new happening there since quite sometime, nothing that I know of that would justify such a rage.
Today everything is calme again; actually Googlebot is completelly vanished from the crawlers list for today.
This never happened in the past, and I guess won’t happen again soon. Any similar experiences?
196 [29/Oct/2004:11:13:47
193 [29/Oct/2004:11:13:48
192 [29/Oct/2004:11:13:46
184 [29/Oct/2004:18:48:26
182 [29/Oct/2004:18:48:25
178 [29/Oct/2004:11:13:51
160 [29/Oct/2004:18:48:24
160 [29/Oct/2004:18:48:18
158 [29/Oct/2004:11:13:45
156 [29/Oct/2004:11:13:44
Top rate: 196 pages per second.
Total Pages: 400,000+
Beat that!
:D
my point was not not to beat you or anybody in numbers:) My website traffic is really low anyway.
I was signalizing an anomaly, and my concerns actually are what is to next. do you know what follows this demential crawl? because as i said, today Googlebot completelly vanished.
does this happen to your website everyday?
I would imagine that Google's either doing a full run test of their new bot, or building an new index, or cleaning house on their existing indecies--getting rid of old pages, or (likely) some combination of the three.
I'm also expecting some major changes in the coming weeks.
[google.com...]
I'm only seeing a couple of instances of the Mozilla Googlebot, which reports itself as being v.2.1, just like the "original" non-Moz bot.
I DO note, however that with the exception of a Dutch bot, I've had NO requests for my robots.txt files since July. Yahoo's bot is freaking out, Googlebot is freaking out ... no rest for the wicked.
Somebody is spoofing Yahoo and Google bot IPs.
The stuff you are reporting is very similar to a situation I reported about the Yahoo bot, late last week ... hundreds of thousands of hits per day in what looks like a bad programming reaction (same pages over and over, etc.)
All of my log entries for the Yahoo-bot-gone-out-of-control are from one IP address, which is definitely one of the Yahoo bot addresses, but just the one, not a series of IPs like the normal bot uses.
Today I woke up and found an 18mb log file, up from the usually 2-3mb (and this includes *zero* image requests logged, css, js etc) mostly filled with Googlebot requests from 66.249.65.101.
I noticed several points where it was crawling at 30/pages second, which is far too much even if I do want google to crawl my website quickly and completely.
All my URL's are presented as static pages using mod_rewrite, and there's nothing the bot could trip up on. I can't remember the last time Google crawled like this, it's not like them!
I noticed that the connections that Mozilla/Googlebot was making specified a KeepAlive option, so requests were basically filed one after the other on the same connection, rather than being piled up in parallel and overwhelming the server.
Seriously, I see no other way to explain why I was getting 200 requests per second, which is practically the *maximum* amount of serial requests that would be possible to fulfil given the latency from Googlebot to my server (30ms or so). I would imagine that the better your site responds the more apt Google would be to send you more traffic (ie. higher ranking)
[edited by: Critter at 1:08 am (utc) on Nov. 2, 2004]
the Google team did reply, here's their answer:
“We are sorry about any excessive strain Google is causing your web servers. If you would like us to slow down the rate at which Googlebot crawls your site, please send us a copy of your most recent weblog that lists Googlebot, and we will pass your request on to our engineers. “
I had very little Googlebot requests for a few days. Then Today I had 30k requests again, at this rate:
% cat access_log ¦ grep Googlebot ¦ awk -F - ‘{print $3}’ ¦ sort ¦ uniq -c ¦ sort -r ¦ head
40 [01/Nov/2004:18:17:28
39 [01/Nov/2004:18:16:50
36 [01/Nov/2004:18:17:22
35 [01/Nov/2004:19:17:06
35 [01/Nov/2004:19:15:51
35 [01/Nov/2004:18:17:32
35 [01/Nov/2004:18:17:29
35 [01/Nov/2004:18:17:21
34 [01/Nov/2004:19:17:09
I should email them with the access logs. Your growing thread here makes me think my website was by accident one of their first targets; now the crawler targets other websites as well at the same greediness.
what is interesting though is that they don't have throttling built into the new crawler. Or Maybe this is the next generation crawler, adapting its crawling speed to the speed of the website reply.
My site is database-driven, but is uber-optimized so it can handle the load...while my competitors, by necessity, are also db-driven but have pages that take 4 to 5 seconds to load. I'm HOPING that the new bot does rank me higher because of the performance disparity.
If they would start differentiating the websites based on their speed, as one side effect this would rank much lower any website that is hosted outside US (because of the delays). Actually, since they launched, they never expanded their crawler centers; only the frontend datacenters, so crawler speaking, they are still US-centrics.