homepage Welcome to WebmasterWorld Guest from 54.166.66.204
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Googlebot not respecting custom crawl rate
rowan194




msg:4325735
 1:17 am on Jun 14, 2011 (gmt 0)

[Apologies if this is not the correct forum - it's about a bot behaving badly, but it may be more appropriate in a Google specific area. I couldn't find an appropriate one.]

I set my custom crawl rate to 0.1/sec (10 second delay between fetches) a couple of months ago, but Googlebot is still fetching at a much faster rate.

I pulled some random samples from recent minutes:

21:02 - 27 fetches
23:29 - 33 fetches
06:48 - 45 fetches
10:52 - 47 fetches
11:02 - 39 fetches

And to be sure it wasn't a longer term thing, I counted the fetches over the last complete day (a 24 hour period) - 57792 fetches, or an average of 40.1 per minute.

10 seconds between fetches should be a maximum of 6 per minute... so why isn't my setting being respected? I just changed it to 15 sec between fetches to see if that gives something a kick.

I've set the custom crawl rate because my site has millions of pages and Google's search referrals are not proportional to the rate at which it's indexing those. The infamous cap? (For several months I was consistently receiving around 30,000 referrals per day, even though Google's count of indexed pages had grown 20-30 times over that period.)

Side note, does anyone else think it's extremely arrogant for Google to reset the requested custom rate after 90 days? Remind me then to review my decision by all means, but don't just quietly drop it and go back to the default, forcing me to log in and change the setting to what I actually want... >:(

 

lucy24




msg:4325773
 3:10 am on Jun 14, 2011 (gmt 0)

:: detour to GWT to check up on nasty suspicion ::

I think there's a loophole. I'm ### if I can find any mention of anything but pages. So in a ten-second period a clever robot can pick up one html page, twenty images, three stylesheets, a couple of javascript files you'd forgotten all about, a midi or two...

My personal record is, I think, 100 fetches in 16 seconds. Based on the occasional behavior of bad robots, that may be my server's physical limit.

Are those horrendous numbers all fetches of pages?

rowan194




msg:4325776
 3:51 am on Jun 14, 2011 (gmt 0)

My OP was based on Google's IP range but I've realised this may also include auxiliary bots like Adsense, translate, web accelerator etc. I've checked an alternate log file and based on the user-agent there were 55,460 fetches from Googlebot itself over a 24 hour period.

These are all indeed text/html fetches.

rowan194




msg:4325780
 4:10 am on Jun 14, 2011 (gmt 0)

Before the custom crawl rate was introduced (and I was able to ask Googlebot through GWT to calm down) it wasn't unusual for G to fetch more than 120,000 pages a day. They ignore the Crawl-Delay robots.txt directive.

It wasn't so much of an issue back then, but now that my database complexity and size has grown, each page takes more I/O to render. The majority of my server load is dedicated to servicing Googlebot's requests!

rowan194




msg:4326785
 6:07 am on Jun 16, 2011 (gmt 0)

55460 13-Jun-2011
50295 14-Jun-2011
57469 15-Jun-2011

Crawl rate doesn't seem to have changed at all, even though I backed it off when I started this thread. Googlebot is now fetching at approximately 10 times the maximum rate I've asked for.

I've ended up momentarily firewalling it in an attempt to get it to back off... yes, I've BLOCKED googlebot!

Does anyone know how I might signal Google that there seems to be an issue?

rowan194




msg:4327608
 12:38 am on Jun 18, 2011 (gmt 0)

I ended up setting a pseudo-proxy so that requests from googlebot were transparently forwarded to another server with a copy of the same database. This removed the excessive load off the main server.

I noticed that currently Googlebot seems to have backed off a bit, but I think it would be premature to assume anything... the actual fetches per hour vary wildly, so it could still catch up (one hour it did 5822 fetches, or 1.6 per second!)

Does ANYONE know how I can get onto a human at google?

g1smd




msg:4327615
 1:27 am on Jun 18, 2011 (gmt 0)

If you are setting this through robots.txt, have you used the WMT feature to fetch the robots.txt file just in case there is something in there that it can't read?

rowan194




msg:4327626
 2:01 am on Jun 18, 2011 (gmt 0)

It's GWT where I'm setting the custom crawl rate. Google specifically ignores the crawl-delay robots.txt directive - GWT even tells you they've ignored that line. :)

I think I may have found the problem - GWT seems to treat domain.com and www.domain.com as separate entities. Only one is listed as confirmed in my case, and googlebot is predominantly fetching from the *other*.

I don't recall it used to be like this, or if they made that change why they didn't offer some way to tie the two entities together (the verification filename is exactly the same for both...)

I've changed the crawl rate on the other one.

So now I have to mark my calendar to update two "domains" every 90 days! >:(

edit: now that domain.com has been verified it does seem to have tied it to www.domain.com. I've set the preferred domain to domain.com as that's the one that has the most indexed pages in G. Would be nice if they (a) forwarded system generated notifications to your email address (rather than requiring you to log into GWT to see them) and (b) proactively invite you to review your settings when some new feature is added that may affect your sites.

[edited by: rowan194 at 2:28 am (utc) on Jun 18, 2011]

g1smd




msg:4327632
 2:23 am on Jun 18, 2011 (gmt 0)

Yes, www and non-www are separate sites as far as WMT is concerned. The incoming external links and crawl rate reports are interesting for each.

You should not be directly serving content on both of them, you need a domain canonicalisation redirect from non-www to www.

rowan194




msg:4327635
 2:34 am on Jun 18, 2011 (gmt 0)

As mentioned above I've changed the preferred domain in GWT. Right now it's showing 171k indexed pages for www.domain.com, and 222k pages indexed for domain.com, so it's almost equal. Wonder how many of those are duped?

I'm continuing to log the hostname in my Apache logs in order to see which version gets the most HUMAN hits, then will set up a 301 redirect for the loser in the near future.

g1smd




msg:4327637
 2:42 am on Jun 18, 2011 (gmt 0)

Use these two Google searches:
[site:www.example.com] and [site:example.com -inurl:www]
to see what's going on.

In particular, is the root homepage listed as the first entry?

If not, that's another indication of problems within the site.

rowan194




msg:4327643
 3:07 am on Jun 18, 2011 (gmt 0)

Thanks for the tip. I had to try a slightly modified query because my site pages often contain 'www' in the URL. I used "site:example.com -inurl:www.example.com"

In this case the counts were 172k with www, 65k without. So I guess the 222k count above was actually counting both, and I should set the preferred domain in GWT to www.example.com...

There's no sign of the root page on the first page of results for either query.

rowan194




msg:4329172
 3:38 am on Jun 22, 2011 (gmt 0)

Here's what I ended up doing.

1. Set preferred domain to www.example.com in GWT

2. Configured web server to 301 redirect all requests for example.com to www.example.com

3. Disabled custom crawl rate in GWT for example.com. (I don't mind if GBot sends 100k requests per day that 301, because this doesn't involve any database I/O, and the redirected URLs get queued rather than fetched immediately)

4. Maintained custom crawl rate of 0.1/sec for www.example.com. Will review this once things have settled.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved