Why Does a Bot to Pretend to be GoogleBot When Probing XMLRPC?

Forum Moderators: phranque

Message Too Old, No Replies

Why Does a Bot to Pretend to be GoogleBot When Probing XMLRPC?

Since it's a known point of attack why bother? Stupid cleverness? End run?

Webwork

12:26 pm on Feb 25, 2016 (gmt 0)

I don't understand the intelligence behind this, if there is any:

46.148.22.18 - - [25/Feb/2016:03:51:00 -0500] "POST /xmlrpc.php HTTP/1.1" 404 7059 "http://mywebsite.com/xmlrpc.php" "MJ12bot/v1.0.6 (http://majestic12.co.uk/bot.php?+)"
46.148.22.18 - - [25/Feb/2016:03:51:59 -0500] "POST /xmlrpc.php HTTP/1.1" 404 7059 "http://mywebsite.com/xmlrpc.php" "MJ12bot/v1.0.6 (http://majestic12.co.uk/bot.php?+)"
46.148.18.162 - - [25/Feb/2016:03:55:15 -0500] "POST /xmlrpc.php HTTP/1.1" 404 7059 "http://mywebsite.com/xmlrpc.php" "Googlebot-Image/1.0"
46.148.18.162 - - [25/Feb/2016:03:55:21 -0500] "POST /xmlrpc.php HTTP/1.1" 404 7059 "http://mywebsite.com/xmlrpc.php" "Googlebot-Image/1.0"

Clearly the bot is going after Wordpress' xmlrpc.php file, which sets off the alarms, so why bother pretending to be MJ12 or Gbot?

To get past filters allowing those bots? Wouldn't one expect a person savvy enough to filter in such a way to also filter out / block xmlrpc attacks / sniffing?

whitespace

8:05 pm on Feb 25, 2016 (gmt 0)

To get past filters allowing those bots?

I imagine so.

Wouldn't one expect a person savvy enough to filter in such a way to also filter out / block xmlrpc attacks / sniffing?

Well... not necessarily.

Lest we forget, WordPress is omnipresent, used by the masses and their dog. Any small advantage could equal a large number of sites.

lucy24

8:30 pm on Feb 25, 2016 (gmt 0)

To get past filters allowing those bots?

How accurate is the spoofing? Fake googlebot is pointless, since Google crawls from known IPs and it doesn't take much savvy to block a "Googlebot" (I once met a GoogleBot, like that, just to make it easier) from a non-Google IP. But MJ12 is distributed. Do all other aspects of the request look legitimate? In particular, the real thing uses one highly distinctive header.

Webwork

1:46 pm on Feb 26, 2016 (gmt 0)

I am a toddler at this aspect of webmastering so your kindness is appreciated.

@ Lucy24 "Google crawls from known IPs" -> I've only gotten to the point of believing that what G says: "We don't inform y'all about our IPs because that data constantly changes." Where, there, is a reliable list? (PM me if only to tell me it's a closely guarded secret.)

I'm not yet to the point of accurately/intelligently scrutinizing headers. I'm currently struggling (mightily, up to 2 AM the other night) with they syntax and operation of Apache's directives. Fun enough to make a grown man lay his head down at times.

lucy24

7:34 pm on Feb 26, 2016 (gmt 0)

We don't inform y'all about our IPs because that data constantly changes.

So they say, but has anyone out there met a bona fide googlebot crawling from a range that didn't resolve to "google" when you look it up?

My opinion: If a non-distributed crawler gets locked out due to showing up from an undocumented range, they have only themselves to blame.

(Mildly amusing corollary: Just a few days ago I discovered that on one site I'd been blocking not only fake googlebots but fake bingbots-- and then I must have forgotten all about it, possibly for several years, because all kinds of legitimate requests from new bing/msn IPs were getting blocked. I dealt with it by simply eliminating the lockout, since you don't meet enough fake bingbots to be worth the trouble. It was a next-to-no-traffic site, so I don't know if they held the sporadic 403s against me.)

whitespace

1:01 pm on Feb 27, 2016 (gmt 0)

So they say, but has anyone out there met a bona fide googlebot crawling from a range that didn't resolve to "google" when you look it up?

Exactly, you need to do a DNS lookup on the IP in order to validate it - they say. Although, once a Google IP; always a Google IP?

Some have reported that Google has crawled from non-US IPs (so for those who have implemented automatic Geo-IP redirection, Google will "more likely" see the correct content) - although I've not seen evidence of this myself (maybe there is something that triggers it)?

Andy Langton

2:49 pm on Feb 27, 2016 (gmt 0)

Some have reported that Google has crawled from non-US IPs (so for those who have implemented automatic Geo-IP redirection, Google will "more likely" see the correct content) - although I've not seen evidence of this myself (maybe there is something that triggers it)?

Locale-aware crawling occurs when Googlebot crawls with one or both of the following configurations:

Geo-distributed crawling: Googlebot appears to be using IP addresses based outside the USA, in addition to the longstanding IP addresses Googlebot uses that appear to be based in the USA.
Language-dependent crawling: Googlebot crawls with an Accept-Language field set in the HTTP header.

[support.google.com...]

lucy24

7:25 pm on Feb 27, 2016 (gmt 0)

But in each case, wouldn't the individual webmaster very quickly become aware of these non-US ranges* and add them to their fake-Googlebot exemption code? You can't just let people go around calling themselves the Googlebot.

:: wandering off to check logs for blocked Googlebot visits to a few specific non-English pages ::

Although, once a Google IP; always a Google IP?

I hope so, because otherwise it's pretty impossible to draw conclusions about log entries from 2013. All I can say is that most of my blocked "Googlebot" requests came from Brazil-- I have to Portuguese content-- or Russia-- ditto. Also a slew from 54.various, which tells its own story. The first thing I checked for was a specific page which exists in Italian because it gets a disproportionate number of Italian visitors. It's got the appropriate hreflang metas, but as far as I can tell it has never been crawled from anywhere but the usual 66.249.whatever-it-is. And it gets crawled every few days, so you'd think they would try...

Nothing region-specific, though.

* Got a vague notion I once found a Google IP allocated to somewhere in Alberta, Canada.

[edited by: lucy24 at 8:07 pm (utc) on Feb 27, 2016]

Andy Langton

7:58 pm on Feb 27, 2016 (gmt 0)

The reverse Dns check still works for overseas Googlebot, incidentally.