Forum Moderators: Robert Charlton & goodroi
I don't like having my site hit so hard by a bot. I rely on google for a substantial portion of my traffic, and so I'm not willing to take any steps that might jeopardize that. I've also heard stories about people who asked google to limit their scans, and the result was that the googlebot never returned.
Does anyone have a story to tell about how they faced my situation and took some action? What was the outcome?
but, even at it's fastest, it was no more than one page every 20-30 seconds...
I loved it, it was great, and I can't wait for my next one! I'd say, let Google ravage your site as much as it wants!
.
One page a second is not really very fast, as robots go. If your server can't keep up with that you've got a problem somewhere. Some spiders, especially the less respectable ones, can hit your site 10X as hard as that.
I agree. It was an average of one page per second. There were instances of 3 pages in 1 second. I have seen my site deliver 10 pages per second. The rate google was crawling my site could have delayed response time for others and caused them to leave. That's my main concern.
I wonder why google needs to generate bursts of high traffic, rather spreading out their requests over time. Other legitimate robots never take so many pages in such a short time span. They also don't penalize people, who ask for less frequent scans. Google doesn't even offer a robots.txt option (like crawl-delay) to slow down their scan.
And then there are the stories of sites not being returned to by the googlebot, after the webmaster makes an request by email to have the scan slowed. Anyone out there have firsthand or secondhand info on that?
Andrew Hitchcock, I asked the crawl team about this a while ago, and there’s a good reason. It turns out that a lot of webmasters give crawl-delay values that are way out of whack, in the sense that we’d only be able to crawl 15-20 urls from a site in an entire day. I’ll try to post more details about that sometime in the future. The crawl guys are interested in allowing people to give some sort of hostload hint, but it’s their opinion that crawl-delay isn’t the best way to do it.
Andrew was asking why Google did not support crawl-delay. As of February 8th, they did not.
Personally, I wouldn't bite the hand that feeds you.
At this point, given the small amount of info that I have, I have no plans to take any action. They are the 800-pound gorilla and I am at their mercy.
I think it's unfortunate that google lacks the courtesy to restrict their crawls to reasonable level.
I guess the bots' masters no longer fear being blocked, so they're doing whatever they please. From where I'm standing it seems like google, yahoo, and anyone else, who sends some traffic to websites or more importantly, might become a good source of traffic in the future, feels free to run wild.
As for limiting requests to every 120 seconds, that's beyond ridiculous. It would be enough time to hand-write individual responses to page requests. A nice Martha-like touch, but few webmasters do it any more.
One reason is that search engine spiders can be a highly distributed user-agent -- that is, all the "parts" of the spidering program do not even need to reside on the same physical machine. The decision about the next url to request is not necessarily made by a simple process that is analagous to a click. Although for many crawling sequences it can look like "following a click trail", there are a lot of logic routines going on to make that decision.
It would be nice if things were so simple as looking at a referer -- but as I said, googlebot sends no referer. In fact, I think it would be impossible even to DEFINE the referer for any particular "get". And even if it were possible, on the scale of a complete web crawl the added CPU cycles and bandwidth would be extensive.
BTW the referrer ID is not available, but the browser identifier is plainly visible and that's probably what was intended.
I wonder why google needs to generate bursts of high traffic
stories of sites not being returned to by the googlebot, after the webmaster makes an request by email to have the scan slowed. Anyone out there have firsthand or secondhand info on that?
The Mozilla Google-bot was burning my site at upto 30,000 hits/month [webmasterworld.com], sometimes 3/sec. G was asked last June to slow it down. They effectively switched it off (just 50/month in the Autumn). The "normal" G-bot also appeared to slow down (about 1,000/month), although the Adsense-bot compensated by going bananas (from 15,000 to 25,000/month) (you just cannot win, can you?).
I have not noticed any effect on visitors from G. Having said that, the site did get crucified in the Sept Update (and later recovered), but my site is hardly unique in that.
BillyS:
From Matt Cutt's blog:I asked the crawl team about this a while ago, and there’s a good reason. It turns out that a lot of webmasters give crawl-delay values that are way out of whack, in the sense that we’d only be able to crawl 15-20 urls from a site in an entire day.
surfin2u:
Yahoo's slurp bot has a crawl delay but recently began to ignore it
I'm not sure if this is good or not. There seem to be some weird search anomalies where my site is coming up high or first in searches where you'd never expect to find it on the first page.
Could it be that Google likes sites that let it crawl all over them?
Has anyone noticed a drop in search hits after asking Google to slow down?
It's eating up a lot of traffic!
Would Google kill you if you used, say, php to give a stripped-down version of pages according to USER_AGENT as you might for Lynx viewers?
These are taken from my logs:
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
"Googlebot/2.1 (+http://www.google.com/bot.html)"
"Googlebot-Image/1.0"
"Mediapartners-Google/2.1"
"googlebot-urlconsole"
First three are 'normal' bots, fourth one checks page content for AdSense. IIRC fifth is from the remove-url bot.
I have reasons to believe referrer strings are send always. I have a test site that is virtually not visited by normal users, it is spidered only. During several months googlebot visited site few hundred times, yet there is no single line in logs without referrer string.
As far as I can tell, Yahoo applies the delay on a per-IP basis, and (if my site is typical) employs a multiplicity of IPs simultaneously. The combo of the above renders Crawl-delay academic.
They're not even bothering doing that on my site now. They took about 3000 pages yesterday, using a delay of about 20 seconds most of the time. Yes, they did come from various IP addresses, but even requests from each individual IP address ignored my crawl-delay.
Regarding google, thanks for the info on the dangers of asking them to slow down. That's what I needed to know. I have no intention of "biting the hand that feeds me".
I'm even willing to put up with Yahoo, despite the fact that their crawler hits me 100 times more frequently than visitors that they refer to me.
(Yahoo) even requests from each individual IP address ignored my crawl-delay
(google) thanks for the info on the dangers of asking them to slow down ... I have no intention of "biting the hand that feeds me"
(Yahoo) despite the fact that their crawler hits me 100 times more frequently than visitors that they refer to me
the site did get crucified in the Sept Update
need to work more on the Comprehension aspects of my statements
Sorry, I did misunderstand. I though you were making a connection between getting crucified and asking google to slow down.
Regarding Yahoo, there are a couple of other threads going about how fed up many webmasters are with them. Welcome to the club!
Sorry, I did misunderstand.
Regarding Yahoo, ... how fed up many webmasters are with them