Yes, Google have a history of giving <people> exactly what they ask for.
Take a look at the first post (msg#521) here for an example...
Personally, I think its great that Google have a sense of humour.
As the old adage goes...
"Be careful what you wish for... it may come true!"
[edited by: lawman at 8:45 pm (utc) on July 6, 2005]
|Personally, I think its great that Google have a sense of humour. |
Do you know Mel Brooks' 2,000-year-old man [en.wikipedia.org]-derived definition of humour?
Humour:- a sabre-toothed tiger enters the cave, drags out your neighbour, and eats him.
Tragedy:- You stub your toe on a rock.
It's always funny when it happens to others.
<snip> I think that I may be due my shot of humour. I am willing to wait.
[edited by: lawman at 8:46 pm (utc) on July 6, 2005]
If we can all abide by TOS #4 (be respectful of other members), then I'm sure I won't have to edit anyone.
Google have told me "We can't guarantee that your site will be crawled at any particular frequency".
Whilst at first sight this does seem to make a nonsense of G's "Googlebot is overloading my servers" [google.com] page, on further consideration it suggests that there really is only an on/off switch, rather than a graduated knob.
The more I think about this, the more important it seems to know just what is the case.
Doesn't google abide by the "crawl-delay" parameter which can be put in robots.txt.
This was probably the better solution.
IMO, we can trust that Googlebot has good behavior in crawling the World Wide Web. In rare occasion it may run wild but I would not bother in asking Google to slow the bots down as it is more prudent to have it fixing itself in natural way. In all cases, overloading is only temporarily in a very short term and things then return normal. No hassle.
Google has commented in the past that they would crawl harder if it weren't for the smaller webmasters complaining. That said, they probably don't like to hear you complaining ;)
|Google has commented in the past that they would crawl harder if it weren't for the smaller webmasters complaining. |
Google could solve that problem in 5 minutes by looking for a crawl-delay: 0 for googlebot in robots.txt.
That would be treated as permission for crawling at high speed.
If they did that, and let it be known that they were doing that, webmasters could decide to add that or not.
It's not a matter of webmasters' complaining. It's to do with secondary web activities (like index builders) honoring the wishes of the primary movers (those providing content).
Here is a compendium of past intelligence on the M_Bot, compiled whilst searching WebmasterWorld for info on the crawl-delay parameter. It is all there if you look for it...
...and this is the first sighting of this bot [webmasterworld.com].
Now that July is finished, here is an update to the stats. First though, these are the timings of the emails:
June 26: (me to Google) "Please stop it."
June 28: (Google to me) "We've reduced the load on your servers"
July 3: (me to Google) "Your bots now do not crawl my site at all."
July 6: (Google to me) "we see no cause for concern at this time"
...and here are comparative stats for the last 3 months: July: ............................Pages
Inktomi Slurp ...................24,065
Google AdSense ..................15,972
Googlebot HTTP/1.0 .................866
Googlebot HTTP/1.1 Mozilla/5.0 ......61
Googlebot HTTP/1.1 Mozilla/5.0 ..25,089
Google AdSense ..................15,424
Inktomi Slurp ...................14,236
Googlebot HTTP/1.0 ...............1,801
Google AdSense ..................14,193
Inktomi Slurp ...................11,654
Googlebot (HTTP/1.0 + HTTP/1.1) ..4,409
As the man said, be careful what you ask for, you might just get it.
Googlebot is "killing" me too, but I dont mind it. For example, on a 1200 page site, I've had 4000+ visits today. ALL my outbounds links are with redirect, so they are part of the 4000.
I see GB getting the same file 10-15 times a day sometimes (since sitemaps came in existence even though I have the frequency on "daily") but it's much better for me. Now my pages get indexed within 2-3 days and I'm getting targeted traffic right away.
thanks for the warning though
I have this problem also, I just received a notification by my anti-abuse script that the mozilla compatible bot from Google triggered it and was automatically blocked from reading further. Now I just hope it won't penalizes my site for it. Damn, this script is there for unruly site leechers, not real web indexers. I never expected Google to behave like that.
AlexK, did you mention how many pages your site contains to be spidered? Hundreds? Thousands?
I am seeing very strange things from the 66.249.65.X bots
Every request is putting double slashes, though I cannot find any links with double slashes. many of these are deep internal pages that are unlikely to have any external inbound links. I've checked for internal links with double slashes and there are none.
"GET //pagename.html HTTP/1.1" 200 21531 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I am getting hundreds of these a day, if I go to the url, it's a 404, but I have a spider tracking script on these pages, and it is showing the pages as being spidered. The only way the spider tracking program could show it is if the page loads, but the page will not load with double slashes.
anyone seeing this in their logs or have a clue what may be happening?
>>>I am seeing very strange things from the 66.249.65.X bots
My site was (?) banned from Google recently in the July 28 changes, and I'm still working to get back into the index. I am see 66.249.65.#*$! crawling this site, several thousand requests per day... and then stopping. The next day, they're at it again.
I don't know exactly what it's for, but I am used to seeing "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" and not "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
What's going on.
|did you mention how many pages your site contains to be spidered? Hundreds? Thousands? |
> ten thousand. 8,237 different pages were viewed in July (by humans, not robots).
That's the secret weapon that I'm keeping in reserve if nothing changes.
|...notification by my anti-abuse script that the mozilla compatible bot from Google triggered it and was automatically blocked from reading further. |
The precise incident on my site that caused this whole thread. Seems that it hasn't changed its ways one jot.
>> That's the secret weapon that I'm keeping in reserve if nothing changes
no need to hold it in reserve at all. Go right ahead and use them.
My concern is with a 10,000+ page site - the sitemap will be huge. Plus the time to code it into a dynamic site. Not impossible, of course, but that time can be used to do more important things.
It's on my list of "things to do soon".
Creating the site map can be very simple. A plain text file, one URL per line, is all you need. Then submit it to Google. You can also do it in phases, so everything doesn't get spidered at once.
Anyway I know it's too late to say this now, but a 10,000-page site with a lot of inbound links will get spidered like crazy. I have a site around the same size, and Googlebot takes an average of 3,500-4,000 pages a day.
|...10,000-page site...Googlebot takes an average of 3,500-4,000 pages a day |
(Boggle) So, you reckon that I took the correct decision, then?
That wasn't even counting the AdSense "Mediapartners" bot. I haven't asked them to slow down, but I do think they crawl more frequently and more intensely than is necessary.
You should be fine, as long as you make sure every page gets crawled at least once during an indexing cycle. I don't know if they really do that "deep crawl" before the update any more, but if a page doesn't get crawled at least once during the indexing cycle then my Google Belief System says that it will probably get ranked lower or not at all.
(At the risk of moving this thread away from the original topic) jomaxx, are you using a sitemap? If so, what is the frequency and how many new pages added to the file? In msg #:12 walkman spoke of a 1200 page site getting 4000+ visits in one day (which is as outrageous as your own case).
I would rather be in my current situation re: G than yours. (Boggle, again).
PS I did ask Google to return the Status Quo re: the M_Bot on 6 July, but with neither reply nor effect. At the time I was annoyed. Now, I am beginning to bless my lucky stars.
I donno whats wrong with this,
Google Visits one of the site I maintain but it seems it doesn't index any page that it visits.
Last time Google index that website is back in April. What can be the reason? Any idea?
Here is last 3 days web stats...
RobotStats - Google Bot (http://www[.]google[.]com/)
User-Agent[View Log] Mozilla/5.0 (compatible; Googlebot/2.1; +http://www[.]google.com/bot[.]html)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www[.]google[.]com/bot.html)
So, what do you say?
today I got 400 visits (so far) on the same 1200 page site.
I also have been crawled hard by the G Mozilla bot the last week (5K-10K pages a day) - but nothing got into the index.
I have read on WW that the Mozilla bot doesn't get pages into the index but what I really need to know is:
Did anyone see the regular G bot come AFTER the Mozilla bot? or did anyone see pages get into the index after a deep crawl of the Mozilla bot?
Andem, KiShOrE - I'd be happy to hear if you have any updates.
|Did anyone see the regular G bot come AFTER the Mozilla bot? |
My own experience + research suggests that the G_Bot needs to hit a page three times before the page gets into the index (see msg#5+7 [webmasterworld.com]). The same research suggests that the M_Bot does not count towards this total, but does suggest that the M_Bot 'scouts' a page first, then the G_Bot follows up.
Just to curdle the blood, I also saw the reverse (the M_Bot hit a page after the G_Bot, and take the page out of the index).
I had a similar issue but worse with Google/Yahoo/MSN all hitting my site at the same time and the AdSense mediabot joining in for fun. Heck, I even upgraded to a dual Xeon server just because of their nonsense.
No way was I going to ask them to slow it down via support as I feared what happened to the OP and I'd fall from grace with the spiders. I stuck the Crawl-delay in robots.txt and it's been much more civilized but it seems the dang spiders are always on my site now taking a page or two as they just can't get it all fast enough anymore.
Thanks for the thread ref AlexK.
I did some more reading and found someone mentioning 2 weeks until pages get into the index (I think it was Dayo_UK, can't find it now) after the Mozilla deep crawl - then I guess I'll wait (urrrrrrrr).
Still on the subject of heavy/fast crawls by the M_Bot, there have been a couple of threads recently on the same subject:
In the past, this type of activity has been followed 2/3 weeks later by a G-update. Get ready.
- Heavy GoogleBot Attack? [webmasterworld.com]: 8 Aug on: 17,000 pages on a 100 visitors/day site; 3 sites, all getting hit.
(msg8): 18 Aug: 37,364 hits in Aug so far
- nonstop crawling [webmasterworld.com]: 16 Aug: 27,000 hits on a 1,500/month site.
(msg8): 18 Aug: 2 sites, each thousands of requests daily from the mozilla bot (500 pages indexed).
| This 33 message thread spans 2 pages: 33 (  2 ) > > |