Forum Moderators: open

Message Too Old, No Replies

Fake Googlebot requests on the rise

Botnet, competitive intelligence, or semi-legit?

         

jdMorgan

4:39 pm on Sep 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm getting lots of requests from what appear to be fake Googlebots.

I say they're fake based on the fact that the reverse-DNS does not resolve back to Google.

Among the requests that I've been able to get an rDNS lookup for, these requests have come from

  • A U.S. business DSL line
  • A large high-end-real-estate brokerage
  • A "family services" provider in a Midwest U.S. state
  • A shipping materials manufacturer (best guess, as there were several organizations with the same name)
  • A London child-education charity
  • Various health-care service providers

    At first I felt sure that these organizations' computers were part of a botnet, but looking over this month's list as summarized above, I'm beginning to suspect that there is an indirect relationship between the keywords/phrases on my site, and those that might be targeted by these organizations crawling with the Googlebot user-agent. I'm not sure how to put that more precisely, but let's just say that my site might come up in a "broad-match" search for their keywords, but almost never in an exact-match search. We wouldn't naturally link to each other, either -- it's just a partial overlap of our keyword spaces.

    Also, it's not beyond possibility (because of their nature and apparent size) that these organizations might be running a Google Appliance to support internal intranet search capabilities, and that this appliance might have some Web crawling capability (not sure).

    On the other hand, it might be some "competitive intelligence" software that's for sale to organizations such as these, and that this software spoofs Googlebot. (If that's the case, I hope Googlebot is a trademark and that Google goes after them for falsely "trading as" Google.)

    The number of these requests seems to be on the rise, and I'm just wondering if anyone here has developed any more-solid information, or has any similar "gut feel" opinions on this subject.

    Thanks,
    Jim

  • Pfui

    11:55 pm on Sep 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Curiously, I also found a couple of health care provider-related fake Googlebots in recent months' access logs.

    Overall, total fake hits were consistent (and low), and most were singles showing rDNS, e.g., U.S. telco ISPs, possibly people fiddling with their UA IDs.

    That said, there were/are repeat professional offenders, all undeterred by 403s:

    .closerlook.com
    .live-servers.net
    .unixbsd.info
    .amazonaws.com

    All fake Googlebot hits were GETs but for one HEAD from local.com. Fake UAs included:

    Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    Mozilla/5.0 (compatible; Googlebot/2.1;[two blank spaces here]http://www.google.com/bot.html)
    Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.14) Gecko/2009082707 Select all Googlebot/2.X (.NET CLR 3.5.30729)
    Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Googlebot 2.1

    HTH

    incrediBILL

    12:39 am on Sep 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Jim,

    Seems to be on the decline IMO with about 10-20/day perhaps.

    When they come from data centers it's probably a lame CGI proxy attempting a good old fashioned proxy hijack to rank off your pages.

    When they come from home IPs, it could still be a local hosted CGI proxy playing games but I'm more often thinking it's botnets attempting spam harvesting and scraping.

    Add to the repeat offenders list:

    cable.casema.nl

    Google Appliance

    That UA should only look like this:
    gsa-crawler (Enterprise; GID01065; yourname@yourcompany.com)
    [code.google.com...]

    dstiles

    4:38 pm on Sep 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    You're probably right, Bill, but I also think some of them may be a) wannabe SEOs checking competitors for google-cloaking or b) someone playing with Firefox UA rotations.

    jdMorgan

    5:28 pm on Sep 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    dstiles,

    My log entries and request filtering rules argue against these accesses coming from browsers with spoofed UAs; The various HTTP request headers are "all wrong" for any browser, but also wrong for a real Googlebot.

    Thanks for the Google Appliance UA, iBill -- Now that I see it again, I remember thinking, "What, the General Services Administration has their own 'bot?" :)

    Jim

    incrediBILL

    6:00 pm on Sep 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    a) wannabe SEOs checking competitors for google-cloaking or b) someone playing with Firefox UA rotations.

    Either of those situations get booted off my server automatically.

    I love frustrating 'em all ;)