Forum Moderators: open
Yesterday, a Googlebot/2.1 crawler coming from 66.249.66.205 literally sieged my website. In one day it requested more than 250,000 pages from this website. If I were to calculate the average requests per second, (250,000/(24*3600)), this issues around 2,89 requests per second, which is not that much. However this particular crawler performed batched requests; this means that it paused for minutes, and then ferociously read dozens of webpages again. I hoped it does not exceed a threshold of 10 pages per second; or 20, you would think this would kill any database driven website. My hopes were wrong:
cat my-really-nice-website.com.20041029 ¦ grep Googlebot ¦ awk -F - ‘{print $3}’ ¦ sort ¦ uniq -c ¦ sort -r ¦ head
47 [29/Oct/2004:21:33:44
44 [29/Oct/2004:08:23:21
41 [29/Oct/2004:21:33:42
41 [29/Oct/2004:08:36:29
41 [29/Oct/2004:08:23:19
40 [29/Oct/2004:08:43:23
40 [29/Oct/2004:08:36:25
40 [29/Oct/2004:08:34:20
39 [29/Oct/2004:10:12:13
39 [29/Oct/2004:09:22:17
As this small measurement shows, there were seconds when the crawler exceeded 40 requests per second; and once issued 47 requests per second
Note: this IP definitelly belongs to Google . There are no new links pointing to my website, nothing new happening there since quite sometime, nothing that I know of that would justify such a rage.
Today everything is calme again; actually Googlebot is completelly vanished from the crawlers list for today.
This never happened in the past, and I guess won’t happen again soon. Any similar experiences?
I think I hit about 20-30 requests a second, on some rather stressful areas of our site, and has seriously effected overal site performance..
So far 160k pages today!
Certainly shows up the "rather dubious" ASP code of the site...(memory munch in full effect!)
Been frantically trying to streamline code all weekend!
Its large scale war out here....
Yesterday I got an email from my hosting company warning me my mod_rewrite in the .htaccess file was using 30% of the CPU.
On further investigation the rule causing the problem dealt with a session ID.
All I can conclude is the google bot has been indexing the same page with different session ids and causing havoc.
The reason I think it is the gbot is the prime user of bandwidth is in the 66.249.65.#*$! range.
The site is a Postnuke site using mod_rewrite for friendly URLs. It's a very common fix so I'm wondering if anyone else is experiencing similar problems?
Think about this a little; I wouldn’t mind having only fast websites as Google results. Old rule: any website pages should load in under 8 seconds; and 8 secs is already an eternity. New rule: How about 1 second? How about a tenth of a second loading time? Think about how much time in your life you would save if all the websites you visit would obey this rule