Forum Moderators: open
I say they're fake based on the fact that the reverse-DNS does not resolve back to Google.
Among the requests that I've been able to get an rDNS lookup for, these requests have come from
At first I felt sure that these organizations' computers were part of a botnet, but looking over this month's list as summarized above, I'm beginning to suspect that there is an indirect relationship between the keywords/phrases on my site, and those that might be targeted by these organizations crawling with the Googlebot user-agent. I'm not sure how to put that more precisely, but let's just say that my site might come up in a "broad-match" search for their keywords, but almost never in an exact-match search. We wouldn't naturally link to each other, either -- it's just a partial overlap of our keyword spaces.
Also, it's not beyond possibility (because of their nature and apparent size) that these organizations might be running a Google Appliance to support internal intranet search capabilities, and that this appliance might have some Web crawling capability (not sure).
On the other hand, it might be some "competitive intelligence" software that's for sale to organizations such as these, and that this software spoofs Googlebot. (If that's the case, I hope Googlebot is a trademark and that Google goes after them for falsely "trading as" Google.)
The number of these requests seems to be on the rise, and I'm just wondering if anyone here has developed any more-solid information, or has any similar "gut feel" opinions on this subject.
Thanks,
Jim
Overall, total fake hits were consistent (and low), and most were singles showing rDNS, e.g., U.S. telco ISPs, possibly people fiddling with their UA IDs.
That said, there were/are repeat professional offenders, all undeterred by 403s:
.closerlook.com
.live-servers.net
.unixbsd.info
.amazonaws.com
All fake Googlebot hits were GETs but for one HEAD from local.com. Fake UAs included:
Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Googlebot/2.1;[two blank spaces here]http://www.google.com/bot.html)
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.14) Gecko/2009082707 Select all Googlebot/2.X (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Googlebot 2.1
HTH
Seems to be on the decline IMO with about 10-20/day perhaps.
When they come from data centers it's probably a lame CGI proxy attempting a good old fashioned proxy hijack to rank off your pages.
When they come from home IPs, it could still be a local hosted CGI proxy playing games but I'm more often thinking it's botnets attempting spam harvesting and scraping.
Add to the repeat offenders list:
cable.casema.nl
Google Appliance
That UA should only look like this:
gsa-crawler (Enterprise; GID01065; yourname@yourcompany.com)
[code.google.com...]
My log entries and request filtering rules argue against these accesses coming from browsers with spoofed UAs; The various HTTP request headers are "all wrong" for any browser, but also wrong for a real Googlebot.
Thanks for the Google Appliance UA, iBill -- Now that I see it again, I remember thinking, "What, the General Services Administration has their own 'bot?" :)
Jim