Beware of Asking Google to Slow Their Mozilla Compatible Robot Down?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Beware of Asking Google to Slow Their Mozilla Compatible Robot Down?

AlexK

1:56 pm on Jul 6, 2005 (gmt 0)

On 28 June G told me "We've reduced the load on your servers".

The M_Bot [webmasterworld.com] (identified by

Mozilla/5.0 (compatible; Googlebot/2.1;

in the referer string) had been hitting my site at upto 3 times/sec (avg: 836/day), triggering unruly-bot-prevention routines [webmasterworld.com] in the PHP-scripts. I had used the G on-site form [google.com] a few days before, and asked them (nicely) to stop it. So far so good.

By July 1 it was clear that--rather than turning the knob down a notch or two--they had switched it off altogether, at least as far as the M_Bot was concerned - there were no hits from this bot at all. The G_Bot (identified just by

Googlebot/2.1 (+http://www.google.com/bot.html)

in the referer string) had also slowed:

So, my site has now dropped from 28,000 G-hits in June to a likely 500 in July. So, beware...

mrMister

2:14 pm on Jul 6, 2005 (gmt 0)

Yes, Google have a history of giving <people> exactly what they ask for.

Take a look at the first post (msg#521) here for an example...

[webmasterworld.com...]

Personally, I think its great that Google have a sense of humour.

As the old adage goes...

"Be careful what you wish for... it may come true!"

[edited by: lawman at 8:45 pm (utc) on July 6, 2005]

AlexK

8:30 pm on Jul 6, 2005 (gmt 0)

mrMister:

Personally, I think its great that Google have a sense of humour.

Do you know Mel Brooks' 2,000-year-old man [en.wikipedia.org]-derived definition of humour?

Humour:- a sabre-toothed tiger enters the cave, drags out your neighbour, and eats him.
Tragedy:- You stub your toe on a rock.

It's always funny when it happens to others.

<snip> I think that I may be due my shot of humour. I am willing to wait.

[edited by: lawman at 8:46 pm (utc) on July 6, 2005]

lawman

8:44 pm on Jul 6, 2005 (gmt 0)

If we can all abide by TOS #4 (be respectful of other members), then I'm sure I won't have to edit anyone.

AlexK

11:20 pm on Jul 6, 2005 (gmt 0)

Update:
Google have told me "We can't guarantee that your site will be crawled at any particular frequency".

Whilst at first sight this does seem to make a nonsense of G's "Googlebot is overloading my servers" [google.com] page, on further consideration it suggests that there really is only an on/off switch, rather than a graduated knob.

The more I think about this, the more important it seems to know just what is the case.

Chico_Loco

11:36 pm on Jul 6, 2005 (gmt 0)

Doesn't google abide by the "crawl-delay" parameter which can be put in robots.txt.

This was probably the better solution.

sit2510

7:55 am on Jul 7, 2005 (gmt 0)

IMO, we can trust that Googlebot has good behavior in crawling the World Wide Web. In rare occasion it may run wild but I would not bother in asking Google to slow the bots down as it is more prudent to have it fixing itself in natural way. In all cases, overloading is only temporarily in a very short term and things then return normal. No hassle.

BReflection

5:26 am on Jul 8, 2005 (gmt 0)

Google has commented in the past that they would crawl harder if it weren't for the smaller webmasters complaining. That said, they probably don't like to hear you complaining ;)

victor

6:58 am on Jul 8, 2005 (gmt 0)

Google has commented in the past that they would crawl harder if it weren't for the smaller webmasters complaining.

Google could solve that problem in 5 minutes by looking for a crawl-delay: 0 for googlebot in robots.txt.

That would be treated as permission for crawling at high speed.

If they did that, and let it be known that they were doing that, webmasters could decide to add that or not.

It's not a matter of webmasters' complaining. It's to do with secondary web activities (like index builders) honoring the wishes of the primary movers (those providing content).

AlexK

1:20 pm on Jul 9, 2005 (gmt 0)

Here is a compendium of past intelligence on the M_Bot, compiled whilst searching WebmasterWorld for info on the crawl-delay parameter. It is all there if you look for it...

I get blasts of activity up to 20 requests a second!

a leopard never changes it's spots

I got 561143 hits from googlebot ... What we saw ... was a denial of service attack

It is definitely following java

Mozilla bot grabbing js files

Everything that gets spidered is not getting indexed

mozilla googlebot has to "approve" of pages before the normal bot will index

it takes compressed pages when offered

msndude: We do support what we call a crawl delay.

Crawl-Delay:

"We're sorry, this robots.txt does NOT validate."

...and this is the first sighting of this bot [webmasterworld.com].

AlexK

6:01 pm on Aug 8, 2005 (gmt 0)

Now that July is finished, here is an update to the stats. First though, these are the timings of the emails:

...and here are comparative stats for the last 3 months:

July: ............................Pages 
Inktomi Slurp ...................24,065 
Google AdSense ..................15,972 
MSNBot ..........................11,990 
Googlebot HTTP/1.0 .................866 
Googlebot HTTP/1.1 Mozilla/5.0 ......61 
. 
June: 
Googlebot HTTP/1.1 Mozilla/5.0 ..25,089 
MSNBot ..........................19,211 
Google AdSense ..................15,424 
Inktomi Slurp ...................14,236 
Googlebot HTTP/1.0 ...............1,801 
. 
May: 
MSNBot ..........................20,475 
Google AdSense ..................14,193 
Inktomi Slurp ...................11,654 
Googlebot (HTTP/1.0 + HTTP/1.1) ..4,409

As the man said, be careful what you ask for, you might just get it.

walkman

4:20 pm on Aug 9, 2005 (gmt 0)

Googlebot is "killing" me too, but I dont mind it. For example, on a 1200 page site, I've had 4000+ visits today. ALL my outbounds links are with redirect, so they are part of the 4000.

I see GB getting the same file 10-15 times a day sometimes (since sitemaps came in existence even though I have the frequency on "daily") but it's much better for me. Now my pages get indexed within 2-3 days and I'm getting targeted traffic right away.

thanks for the warning though

koan

7:39 pm on Aug 9, 2005 (gmt 0)

I have this problem also, I just received a notification by my anti-abuse script that the mozilla compatible bot from Google triggered it and was automatically blocked from reading further. Now I just hope it won't penalizes my site for it. Damn, this script is there for unruly site leechers, not real web indexers. I never expected Google to behave like that.

jomaxx

11:28 pm on Aug 9, 2005 (gmt 0)

AlexK, did you mention how many pages your site contains to be spidered? Hundreds? Thousands?

my3cents

1:53 pm on Aug 10, 2005 (gmt 0)

I am seeing very strange things from the 66.249.65.X bots

Every request is putting double slashes, though I cannot find any links with double slashes. many of these are deep internal pages that are unlikely to have any external inbound links. I've checked for internal links with double slashes and there are none.

example: [domain.com...]

"GET //pagename.html HTTP/1.1" 200 21531 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I am getting hundreds of these a day, if I go to the url, it's a 404, but I have a spider tracking script on these pages, and it is showing the pages as being spidered. The only way the spider tracking program could show it is if the page loads, but the page will not load with double slashes.

anyone seeing this in their logs or have a clue what may be happening?

Andem

4:07 pm on Aug 10, 2005 (gmt 0)

>>>I am seeing very strange things from the 66.249.65.X bots

My site was (?) banned from Google recently in the July 28 changes, and I'm still working to get back into the index. I am see 66.249.65.#*$! crawling this site, several thousand requests per day... and then stopping. The next day, they're at it again.

I don't know exactly what it's for, but I am used to seeing "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" and not "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

What's going on.

AlexK

8:04 pm on Aug 10, 2005 (gmt 0)

jomaxx:

did you mention how many pages your site contains to be spidered? Hundreds? Thousands?

> ten thousand. 8,237 different pages were viewed in July (by humans, not robots).

walkman:

...sitemaps...

That's the secret weapon that I'm keeping in reserve if nothing changes.

koan:

...notification by my anti-abuse script that the mozilla compatible bot from Google triggered it and was automatically blocked from reading further.

The precise incident on my site that caused this whole thread. Seems that it hasn't changed its ways one jot.

walkman

10:48 pm on Aug 10, 2005 (gmt 0)

>> That's the secret weapon that I'm keeping in reserve if nothing changes

no need to hold it in reserve at all. Go right ahead and use them.

AlexK

11:20 pm on Aug 10, 2005 (gmt 0)

My concern is with a 10,000+ page site - the sitemap will be huge. Plus the time to code it into a dynamic site. Not impossible, of course, but that time can be used to do more important things.

It's on my list of "things to do soon".

jomaxx

5:22 am on Aug 11, 2005 (gmt 0)

Creating the site map can be very simple. A plain text file, one URL per line, is all you need. Then submit it to Google. You can also do it in phases, so everything doesn't get spidered at once.

Anyway I know it's too late to say this now, but a 10,000-page site with a lot of inbound links will get spidered like crazy. I have a site around the same size, and Googlebot takes an average of 3,500-4,000 pages a day.

AlexK

2:27 am on Aug 12, 2005 (gmt 0)

jomaxx:

...10,000-page site...Googlebot takes an average of 3,500-4,000 pages a day

(Boggle) So, you reckon that I took the correct decision, then?

jomaxx

6:14 pm on Aug 12, 2005 (gmt 0)

That wasn't even counting the AdSense "Mediapartners" bot. I haven't asked them to slow down, but I do think they crawl more frequently and more intensely than is necessary.

You should be fine, as long as you make sure every page gets crawled at least once during an indexing cycle. I don't know if they really do that "deep crawl" before the update any more, but if a page doesn't get crawled at least once during the indexing cycle then my Google Belief System says that it will probably get ranked lower or not at all.

AlexK

7:19 pm on Aug 12, 2005 (gmt 0)

(At the risk of moving this thread away from the original topic)

jomaxx

, are you using a sitemap? If so, what is the frequency and how many new pages added to the file? In msg #:12

walkman

spoke of a 1200 page site getting 4000+ visits in one day (which is as outrageous as your own case).

I would rather be in my current situation re: G than yours. (Boggle, again).

PS I did ask Google to return the Status Quo re: the M_Bot on 6 July, but with neither reply nor effect. At the time I was annoyed. Now, I am beginning to bless my lucky stars.

KiShOrE

12:55 am on Aug 13, 2005 (gmt 0)

I donno whats wrong with this,

Google Visits one of the site I maintain but it seems it doesn't index any page that it visits.

Last time Google index that website is back in April. What can be the reason? Any idea?

Here is last 3 days web stats...

+++++++++++++++++++++++
RobotStats - Google Bot (http://www[.]google[.]com/)

User-Agent[View Log] Mozilla/5.0 (compatible; Googlebot/2.1; +http://www[.]google.com/bot[.]html)
Quantity6732

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www[.]google[.]com/bot.html)
6732

++++++++++++++++++++++++++++++++++++++++++++++

So, what do you say?

walkman

1:08 am on Aug 13, 2005 (gmt 0)

today I got 400 visits (so far) on the same 1200 page site.

flex55

11:51 am on Aug 15, 2005 (gmt 0)

Andem, KiShOrE:

I also have been crawled hard by the G Mozilla bot the last week (5K-10K pages a day) - but nothing got into the index.

I have read on WW that the Mozilla bot doesn't get pages into the index but what I really need to know is:

Did anyone see the regular G bot come AFTER the Mozilla bot? or did anyone see pages get into the index after a deep crawl of the Mozilla bot?

Andem, KiShOrE - I'd be happy to hear if you have any updates.

AlexK

12:44 am on Aug 16, 2005 (gmt 0)

flex55:

Did anyone see the regular G bot come AFTER the Mozilla bot?

Yes.

My own experience + research suggests that the G_Bot needs to hit a page three times before the page gets into the index (see msg#5+7 [webmasterworld.com]). The same research suggests that the M_Bot does not count towards this total, but does suggest that the M_Bot 'scouts' a page first, then the G_Bot follows up.

Just to curdle the blood, I also saw the reverse (the M_Bot hit a page after the G_Bot, and take the page out of the index).

incrediBILL

2:13 am on Aug 16, 2005 (gmt 0)

I had a similar issue but worse with Google/Yahoo/MSN all hitting my site at the same time and the AdSense mediabot joining in for fun. Heck, I even upgraded to a dual Xeon server just because of their nonsense.

No way was I going to ask them to slow it down via support as I feared what happened to the OP and I'd fall from grace with the spiders. I stuck the Crawl-delay in robots.txt and it's been much more civilized but it seems the dang spiders are always on my site now taking a page or two as they just can't get it all fast enough anymore.

flex55

2:18 pm on Aug 16, 2005 (gmt 0)

Thanks for the thread ref AlexK.

I did some more reading and found someone mentioning 2 weeks until pages get into the index (I think it was Dayo_UK, can't find it now) after the Mozilla deep crawl - then I guess I'll wait (urrrrrrrr).

AlexK

5:34 am on Aug 20, 2005 (gmt 0)

Still on the subject of heavy/fast crawls by the M_Bot, there have been a couple of threads recently on the same subject:

Heavy GoogleBot Attack? [webmasterworld.com]: 8 Aug on: 17,000 pages on a 100 visitors/day site; 3 sites, all getting hit.
(msg8): 18 Aug: 37,364 hits in Aug so far
nonstop crawling [webmasterworld.com]: 16 Aug: 27,000 hits on a 1,500/month site.
(msg8): 18 Aug: 2 sites, each thousands of requests daily from the mozilla bot (500 pages indexed).

In the past, this type of activity has been followed 2/3 weeks later by a G-update. Get ready.

This 33 message thread spans 2 pages: 33