| 3:10 am on Jun 14, 2011 (gmt 0)|
:: detour to GWT to check up on nasty suspicion ::
My personal record is, I think, 100 fetches in 16 seconds. Based on the occasional behavior of bad robots, that may be my server's physical limit.
Are those horrendous numbers all fetches of pages?
| 3:51 am on Jun 14, 2011 (gmt 0)|
My OP was based on Google's IP range but I've realised this may also include auxiliary bots like Adsense, translate, web accelerator etc. I've checked an alternate log file and based on the user-agent there were 55,460 fetches from Googlebot itself over a 24 hour period.
These are all indeed text/html fetches.
| 4:10 am on Jun 14, 2011 (gmt 0)|
Before the custom crawl rate was introduced (and I was able to ask Googlebot through GWT to calm down) it wasn't unusual for G to fetch more than 120,000 pages a day. They ignore the Crawl-Delay robots.txt directive.
It wasn't so much of an issue back then, but now that my database complexity and size has grown, each page takes more I/O to render. The majority of my server load is dedicated to servicing Googlebot's requests!
| 6:07 am on Jun 16, 2011 (gmt 0)|
Crawl rate doesn't seem to have changed at all, even though I backed it off when I started this thread. Googlebot is now fetching at approximately 10 times the maximum rate I've asked for.
I've ended up momentarily firewalling it in an attempt to get it to back off... yes, I've BLOCKED googlebot!
Does anyone know how I might signal Google that there seems to be an issue?
| 12:38 am on Jun 18, 2011 (gmt 0)|
I ended up setting a pseudo-proxy so that requests from googlebot were transparently forwarded to another server with a copy of the same database. This removed the excessive load off the main server.
I noticed that currently Googlebot seems to have backed off a bit, but I think it would be premature to assume anything... the actual fetches per hour vary wildly, so it could still catch up (one hour it did 5822 fetches, or 1.6 per second!)
Does ANYONE know how I can get onto a human at google?
| 1:27 am on Jun 18, 2011 (gmt 0)|
If you are setting this through robots.txt, have you used the WMT feature to fetch the robots.txt file just in case there is something in there that it can't read?
| 2:01 am on Jun 18, 2011 (gmt 0)|
It's GWT where I'm setting the custom crawl rate. Google specifically ignores the crawl-delay robots.txt directive - GWT even tells you they've ignored that line. :)
I think I may have found the problem - GWT seems to treat domain.com and www.domain.com as separate entities. Only one is listed as confirmed in my case, and googlebot is predominantly fetching from the *other*.
I don't recall it used to be like this, or if they made that change why they didn't offer some way to tie the two entities together (the verification filename is exactly the same for both...)
I've changed the crawl rate on the other one.
So now I have to mark my calendar to update two "domains" every 90 days! >:(
edit: now that domain.com has been verified it does seem to have tied it to www.domain.com. I've set the preferred domain to domain.com as that's the one that has the most indexed pages in G. Would be nice if they (a) forwarded system generated notifications to your email address (rather than requiring you to log into GWT to see them) and (b) proactively invite you to review your settings when some new feature is added that may affect your sites.
[edited by: rowan194 at 2:28 am (utc) on Jun 18, 2011]
| 2:23 am on Jun 18, 2011 (gmt 0)|
Yes, www and non-www are separate sites as far as WMT is concerned. The incoming external links and crawl rate reports are interesting for each.
You should not be directly serving content on both of them, you need a domain canonicalisation redirect from non-www to www.
| 2:34 am on Jun 18, 2011 (gmt 0)|
As mentioned above I've changed the preferred domain in GWT. Right now it's showing 171k indexed pages for www.domain.com, and 222k pages indexed for domain.com, so it's almost equal. Wonder how many of those are duped?
I'm continuing to log the hostname in my Apache logs in order to see which version gets the most HUMAN hits, then will set up a 301 redirect for the loser in the near future.
| 2:42 am on Jun 18, 2011 (gmt 0)|
Use these two Google searches:
[site:www.example.com] and [site:example.com -inurl:www]
to see what's going on.
In particular, is the root homepage listed as the first entry?
If not, that's another indication of problems within the site.
| 3:07 am on Jun 18, 2011 (gmt 0)|
Thanks for the tip. I had to try a slightly modified query because my site pages often contain 'www' in the URL. I used "site:example.com -inurl:www.example.com"
In this case the counts were 172k with www, 65k without. So I guess the 222k count above was actually counting both, and I should set the preferred domain in GWT to www.example.com...
There's no sign of the root page on the first page of results for either query.
| 3:38 am on Jun 22, 2011 (gmt 0)|
Here's what I ended up doing.
1. Set preferred domain to www.example.com in GWT
2. Configured web server to 301 redirect all requests for example.com to www.example.com
3. Disabled custom crawl rate in GWT for example.com. (I don't mind if GBot sends 100k requests per day that 301, because this doesn't involve any database I/O, and the redirected URLs get queued rather than fetched immediately)
4. Maintained custom crawl rate of 0.1/sec for www.example.com. Will review this once things have settled.