PageFetcher-Google-CoOp

Forum Moderators: open

Message Too Old, No Replies

PageFetcher-Google-CoOp

aristotle

3:05 pm on Mar 5, 2015 (gmt 0)

I noticed this one showing up on one of my sites:

Host: 66.249.88.244
/
Http Code: 200 Date: Mar 05 09:31:34 Http Version: HTTP/1.1 Size in Bytes: 48508
Referer: -
Agent: PageFetcher-Google-CoOp;((+http://www.google.com/coop/cse/cref)

From what I can find, this bot appears to be associated with Google Custom Search. I wonder if this means that somebody has included my site in the index for a custom search on their site.

wilderness

3:46 pm on Mar 5, 2015 (gmt 0)

!^66\.249\.(6[4-9]|[7][0-9])\.

The 88 Class C is outside the standard g-bot range.

aristotle

3:59 pm on Mar 5, 2015 (gmt 0)

wilderness -
Do you mean that this isn't a legitimate bot? Because if you do a web search, you can find articles about it.

wilderness

4:23 pm on Mar 5, 2015 (gmt 0)

Correct!

Google's bots only come from the 64-79 range.

wilderness

4:27 pm on Mar 5, 2015 (gmt 0)

FWIW, the link in the UA merely takes you to a google account login page (i. e., a google customer) and NOT a valid google SE page.

aristotle

5:02 pm on Mar 5, 2015 (gmt 0)

Thanks wilderness
faking a googlebot suggests a definite motivation to hide the identity and purpose of the perp.

lucy24

6:20 pm on Mar 5, 2015 (gmt 0)

!^66\.249\.(6[4-9]|[7][0-9])\.

or, as I put it, ... 6[4-9]|7\d
\d vs. [0-9] probably doesn't matter, apart from the difference of three bytes in filesize, but [7] for 7 is an added nanosecond when it's happening in htaccess.

aristotle, I've been tripped-up by this one too. Google has two adjacent but unrelated 66.249.blahblah ranges: googlebot at 64-79 and then other activities at 80-95. (And then, to make it even more fun, there's domaintools at 66.249.0.0/19.)

I wouldn't necessarily call it "faking a googlebot" though, because there's a mixed bag of activities, including but not limited to
:: shuffling papers ::
Preview, Translate, Snippet, new favicon (the one that calls itself FF6)

I currently block the range unless they are either using the Google name or sending the X-Forwarded-For header. ymmv; I don't mind translation. But it certainly is convenient that they happen to break at precisely 79:80 -- as opposed to some other /20 which would not come out to a multiple of 10 -- since it makes Regular Expressions a little easier to construct ;)

aristotle

6:52 pm on Mar 5, 2015 (gmt 0)

Thanks Lucy
I really don't worry much about cases like this unless it starts showing up often enough to annoy me. I'd never noticed this one before, so that's why I posted about it.

The IP lookup gives:

IP: 66.249.88.244
Hostname: google-proxy-66-249-88-244.google.com
ISP: Google
Organization: Google
Services: Suspected proxy server
Type: Corporate
Assignment: Static IP

I can remember some other threads where people were wondering about requests coming through Google proxies at IPs in this neighborhood.

lucy24

8:35 pm on Mar 5, 2015 (gmt 0)

Do you record headers? Google tends to be pretty decent about sending the x-forwarded-for header -- for example in Translate, or that phone-rendering thing whose name I've gone blank on. Then you can find out who's really behind the request.

aristotle

8:43 pm on Mar 5, 2015 (gmt 0)

Well I don't know how to record headers, or even if I can do it with the info that cPanel provides.

wilderness

8:50 pm on Mar 5, 2015 (gmt 0)

Do you record headers

Can I do that with my VCR or Blu-Ray :)

lucy24

10:15 pm on Mar 5, 2015 (gmt 0)

or even if I can do it with the info that cPanel provides

Nothing to do with cPanel; it's best done in your own HTML. Within your existing page footer-- assuming you've already got something, whether an SSI or a php include-- add something along these lines (this is my current version, but I think incrediBill created the original code):

if (!function_exists('getallheaders'))
� {
� function getallheaders()
� � {
� � $headers = '';
� � foreach ($_SERVER as $name => $value)
� � � {
� � � if (substr($name, 0, 5) == 'HTTP_')
� � � � { $headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value; }
� � � }
� � return $headers;
� � }
� }

$ip = get_server('REMOTE_ADDR');
$fh = fopen($_SERVER['DOCUMENT_ROOT'] . "/boilerplate/headers-". date('Ymd') . ".log","a");
fwrite($fh, date('Y-m-d:') . date("H:i:s\n"));
fwrite($fh, "IP: $ip\n");

foreach (getallheaders() as $name => $value)
� {
� fwrite($fh, "$name: $value\n");
� }

fwrite($fh, "----\n\n");
fclose($fh);

I don't generally care for using code that I don't personally understand-- I speak about three words of php, and this function uses more than those three-- but it works as intended.

Note this bit:

fopen($_SERVER['DOCUMENT_ROOT'] . "/boilerplate/headers-". date('Ymd') . ".log","a")

That's where you set the name and location of the file that will store each day's headers. The "a" flag means "add to this file if it exists, and create one if it doesn't". I happen to have a directory called /boilerplate/ so I used that. You don't have to use the .log extension but it seemed most practical, since that's what your access and error logs use. (I've instructed my computer to open everything with .log extension in SubEthaEdit. I think it defaulted to Terminal, or at best TextEdit.)

This version works only on pages-- that is, anything that happens to have a footer. But I've included the code on my custom 403 page, so request headers from blocked robots get recorded along with all the others. In the rare case of a blocked image or pdf request, even those headers will get logged, because the server sends back the same 403 page regardless. This is sometimes useful if you're trying to figure out why the request was denied.

Unlike access/error logs, this is a file stored in your personal webspace. So it will stay there forever; I go in every few weeks, download to my HD, and delete all but the current day's file. What you do with the logs is up to you. I tend to keep them for a few months in case I think of something to check.

All this is just for your information. If you simply want to handle requests differently depending on where they "really" came from, you might have something like

RewriteCond %{HTTP_FORWARDED} ^11\.22\.33

... and then whatever you want the rule to do. If, instead of a number, you just say . or !. it means "If this header exists (or doesn't exist) at all".

aristotle

11:50 pm on Mar 5, 2015 (gmt 0)

Thanks for taking so much of your time, Lucy. I didn't intend for you to do that.

But I prefer not to implement that kind of script at this time. All of the pages on my sites are static html, Also, almost all of them haven't been touched in a long time, and I don't like to make changes unless I'm forced to. I've got along without keeping any record of headers so far, and I don't enjoy time spent looking at logs, or working on .htaccess.

Thanks again for taking time to prepare that post.

keyplyr

2:04 am on Mar 6, 2015 (gmt 0)

I block "fetch" if found in the UA string. This stops Feedfetcher-Google and one of the Apple Siri bots (Fetcher/1.0.)

I don't publish a RSS and it's a mystery to me why Feedfetcher-Google continues to request images files & pdf, so I've always blocked it. So far I don't see any ill affects.

I don't like Apple anything. This goes way back to them wanting to charge me for developer license just to test a few of my ideas on their platform.

I allow only the Googlebot ranges (as wilderness mentioned) so all the proxies are blocked if "google" is a UA attribute.