Google Tools

Forum Moderators: open

Message Too Old, No Replies

Google Tools

wilderness

1:29 am on Aug 1, 2008 (gmt 0)

Might anybody be aware of "what tools" and what IP ranges are utilized by "each tool"

in the 66.249. Class C's 84 & 85?

wilderness

2:06 pm on Aug 1, 2008 (gmt 0)

It was not my desire to hi-jack the MSM thread and I thought this alternative a better option.

Yesterday Samizdata provided the following:

you can try the Google Wireless Transcoder [google.com]:

In order to make this work on my websites, I had to remove three different lines (one for an IP and two for UA's).
In addition, the images were utilized from a 66.249.84.zz, rather than a 72.14.

The UA used for these image requests is also slightly different from the normal transcoder UA and I'm not sure that it wouldn't trip some whitelisting that many of us are using?
Anybody aware?

My reference data regarding the 84 & 85 Class C's (which I deny) are nil! (bad practice).

TIA

Don

Samizdata

3:36 pm on Aug 1, 2008 (gmt 0)

Might anybody be aware of "what tools" and what IP ranges are utilized by "each tool"

Some Google tools seem to use IP ranges interchangeably and cannot be isolated in this way.

The tools to consider are Wireless Transcoder, FeedFetcher, Sitemaps, Translator and Web Accelerator - plus the automated fetches (usually with a Linux/FireFox UA) that I have always assumed are "quality control" checks for deceptive cloaking.

Unlike some here I do not intercept any Google IPs, but as noted above it can be necessary to adjust .htaccess traps to make exceptions for thelr tools, and as incrediBILL has often pointed out there is potential for abuse in their proxy services. So informed choices are what we need.

The UA used for these image requests is also slightly different from the normal transcoder UA

Indeed - "Windows NT 5.0;Google Wireless Transcoder;" has no space after the semi-colon and may trip a check for a valid NT string unless an exception is made. Image requests can also use a different IP range to the associated HTML request.

Another exception I make is for Google-Sitemaps (Accept header), and the Web Accelerator also seems to have changed recently, though addressing it by IP always seemed a bad idea.

So I'm afraid there is no simple IP-based solution, and my method is to allow any Google-registered IP and deal with the various tools based on headers and user-agents, usually by making exceptions.

I concede that this is something of a Faustian pact, but Google are generally much better behaved (and more efficient) than any of their competitors. And they actually send human traffic.

In my experience, however, very few mobile devices use the Wireless Transcoder.

...

wilderness

3:49 pm on Aug 1, 2008 (gmt 0)

many thanks.

So I'm afraid there is no simple IP-based solution, and my method is to allow any Google-registered IP

Wish I could do this however due to my extensive RIPE restrictions, most translators are denied in my sites (a bad practice for most webmasters).

Same restrictions apply to Web Accelerators (not just Google), which are denied access to my sites as a result of my own desire to limit cache (also a bad practice for most webmasters).

In summary, I guess there is not a simple method for me to deal with these mobile devices (removing restrictions), I'll simply need to monitor my logs and then remove the previous rewrites through trial and error.

Sigh!

Samizdata

4:00 pm on Aug 1, 2008 (gmt 0)

Same restrictions apply to Web Accelerators

To clarify, I don't allow prefetching by Google Web Accelerator, but I don't block it by IP address.

And I don't use the method given on Google's site, which does not seem to work.

...

Samizdata

4:25 pm on Aug 1, 2008 (gmt 0)

a simple method for me to deal with these mobile devices

For mobile devices in general I have evolved a fairly complicated method using PHP.

In .htaccess I have exceptions for Blackberries, PlayStations and WinWAP.

I also have several mobile proxies blocked as some are wide open and allow Googlebot to crawl.

Mobiles are a whole 'nuther subject.

...

dstiles

7:45 pm on Aug 1, 2008 (gmt 0)

In what way does prefetch not work as advertised, Samizdata? I seem to trap x_prefetch on the HTTP_X_MOZ for both google listing prefetches and through the accelerator proxies. Or am I only seeing a small portion of them?

Samizdata

8:16 pm on Aug 1, 2008 (gmt 0)

Earlier this year I installed the Google Web Accelerator and did some testing.

I found that it did not use a Google IP for prefetching, but used my own.

I tried intercepting it with the Google-provided method and failed every time.

The only method that worked for me is at [webmasterworld.com...]

I can't explain the discrepancy, but I can replicate it.

...

dstiles

10:14 pm on Aug 1, 2008 (gmt 0)

Ok Samizdata. Thanks. I'll follow up on that when I have a moment.

dstiles

10:50 pm on Aug 8, 2008 (gmt 0)

Further to this, I've trapped two IPs in the 66.249.84.* and 66.249.85.* ranges this week, five hits in total over the two IPs. Most were standard MSIE's but the latest one was:

Mozilla/4.0 (compatible; WebCapture 3.0; Windows)

Now that IS bad. Surely google wouldn't use a scraper UA?

Last month I got a visit from the google sitemap robot on the 85 block - genuine as I'd just submitted one.

Why do I get the idea they are running bots and proxies or accellerators on the same IP range?

wilderness

12:31 am on Aug 9, 2008 (gmt 0)

WebCapture is the PDF of a webpage and/or site.

dstiles

1:59 am on Aug 9, 2008 (gmt 0)

I'm fairly sure you're not talking about the popular file format here...?

As far as I can make out it's an image uploader and/or a web site ripper.

wilderness

3:10 am on Aug 9, 2008 (gmt 0)

I'm fairly sure you're not talking about the popular file format here...?

WebCapture is the name added to the UA when the full version of Acrobat (i. e., Adobe) is used along with the Print/Distiller function to create a PDF of a single page or an entire website.
The level of directory depth as well as outbound (other websites) is entirely configurable.

Samizdata

3:18 am on Aug 9, 2008 (gmt 0)

Surely google wouldn't use a scraper UA?

In recent months I have seen an identical IP in the 66.249.84.nn range used for:

Google-Sitemaps
Google Wireless Transcoder
Images fetched by the Wireless Transcoder (no UA)
A single JavaScript file (Firefox UA)
A single CSS file (Firefox UA)
A search done on MSN/Live (IE7 UA)
A search done with the bsalsa.com Embedded Web Browser (nonsense UA)
An inbound link from Wikipedia (infamous AVG 1813 UA)

As mentioned above I do not block any Google IPs, and I am unconcerned by what I assume to be "quality control" checks (JavaScript and CSS) as I accept that they have to do this occasionally.

The last two items, however, were intercepted because of their user-agents.

The MSN/Live search is intriguing.

...

dstiles

7:46 pm on Aug 9, 2008 (gmt 0)

Wilderness - thanks for the info. It's the full website bit that caught my eye elsewhere but without the acrobat reference. Banned, anyway. There are enough site-rippers around without making PDF's of sites.

Samizdata - an interesting list. As I mentioned, you can add MSIE varieties to that list. It would be worth passing through sitemaps in that IP range but I'd probably return a 403 for the others. I don't believe they are doing quality checks - not with all of that UA baggage. My take would be floating a proxy of some kind.

OR... Could it be click-throughs from the Cache option on google listings?

Interesting observation re: blocking bsalsa. I've been blocking it for some time, again because of its advertised download capability. I was going to ask here about it a couple of days ago but ran out of time.

Are you sure the wiki one was AVG? If it was the "user-agent:" UA it could have been a bot or whatever, as mentioned in my new thread "User-Agent: Mozilla". Or it could have been an AVG forgery - I got multiple hits from one of those a couple of days ago.

Samizdata

9:22 pm on Aug 9, 2008 (gmt 0)

I don't believe they are doing quality checks

I have always assumed that when Mountain View fetches a single JavaScript or CSS file (which Googlebot is forbidden to do in my robots.txt) it is a way of checking that nothing untoward is going on - they have to do it somehow, after all, and I don't find it too intrusive.

Could it be click-throughs from the Cache

Not in my case.

Are you sure the wiki one was AVG?

I am sure it was the AVG user-agent, which was enough to get my "robots policy" file.

I would stress that the list I gave is for one IP only, and was meant in part to illustrate the impossibility of addressing the various Google tools by IP address (Don's original question).

I haven't investigated bsalsa in any depth as it rarely turns up.

...

Google Tools

wilderness

wilderness

Samizdata

wilderness

Samizdata

Samizdata

dstiles

Samizdata

dstiles

dstiles

wilderness

dstiles

wilderness

Samizdata

dstiles

Samizdata

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week