Forum Moderators: open
Yesterday Samizdata provided the following:
you can try the Google Wireless Transcoder [google.com]:
In order to make this work on my websites, I had to remove three different lines (one for an IP and two for UA's).
In addition, the images were utilized from a 66.249.84.zz, rather than a 72.14.
The UA used for these image requests is also slightly different from the normal transcoder UA and I'm not sure that it wouldn't trip some whitelisting that many of us are using?
Anybody aware?
My reference data regarding the 84 & 85 Class C's (which I deny) are nil! (bad practice).
TIA
Don
Might anybody be aware of "what tools" and what IP ranges are utilized by "each tool"
Some Google tools seem to use IP ranges interchangeably and cannot be isolated in this way.
The tools to consider are Wireless Transcoder, FeedFetcher, Sitemaps, Translator and Web Accelerator - plus the automated fetches (usually with a Linux/FireFox UA) that I have always assumed are "quality control" checks for deceptive cloaking.
Unlike some here I do not intercept any Google IPs, but as noted above it can be necessary to adjust .htaccess traps to make exceptions for thelr tools, and as incrediBILL has often pointed out there is potential for abuse in their proxy services. So informed choices are what we need.
The UA used for these image requests is also slightly different from the normal transcoder UA
Indeed - "Windows NT 5.0;Google Wireless Transcoder;" has no space after the semi-colon and may trip a check for a valid NT string unless an exception is made. Image requests can also use a different IP range to the associated HTML request.
Another exception I make is for Google-Sitemaps (Accept header), and the Web Accelerator also seems to have changed recently, though addressing it by IP always seemed a bad idea.
So I'm afraid there is no simple IP-based solution, and my method is to allow any Google-registered IP and deal with the various tools based on headers and user-agents, usually by making exceptions.
I concede that this is something of a Faustian pact, but Google are generally much better behaved (and more efficient) than any of their competitors. And they actually send human traffic.
In my experience, however, very few mobile devices use the Wireless Transcoder.
...
So I'm afraid there is no simple IP-based solution, and my method is to allow any Google-registered IP
Wish I could do this however due to my extensive RIPE restrictions, most translators are denied in my sites (a bad practice for most webmasters).
Same restrictions apply to Web Accelerators (not just Google), which are denied access to my sites as a result of my own desire to limit cache (also a bad practice for most webmasters).
In summary, I guess there is not a simple method for me to deal with these mobile devices (removing restrictions), I'll simply need to monitor my logs and then remove the previous rewrites through trial and error.
Sigh!
a simple method for me to deal with these mobile devices
For mobile devices in general I have evolved a fairly complicated method using PHP.
In .htaccess I have exceptions for Blackberries, PlayStations and WinWAP.
I also have several mobile proxies blocked as some are wide open and allow Googlebot to crawl.
Mobiles are a whole 'nuther subject.
...
I found that it did not use a Google IP for prefetching, but used my own.
I tried intercepting it with the Google-provided method and failed every time.
The only method that worked for me is at [webmasterworld.com...]
I can't explain the discrepancy, but I can replicate it.
...
Mozilla/4.0 (compatible; WebCapture 3.0; Windows)
Now that IS bad. Surely google wouldn't use a scraper UA?
Last month I got a visit from the google sitemap robot on the 85 block - genuine as I'd just submitted one.
Why do I get the idea they are running bots and proxies or accellerators on the same IP range?
I'm fairly sure you're not talking about the popular file format here...?
WebCapture is the name added to the UA when the full version of Acrobat (i. e., Adobe) is used along with the Print/Distiller function to create a PDF of a single page or an entire website.
The level of directory depth as well as outbound (other websites) is entirely configurable.
Surely google wouldn't use a scraper UA?
In recent months I have seen an identical IP in the 66.249.84.nn range used for:
Google-Sitemaps
Google Wireless Transcoder
Images fetched by the Wireless Transcoder (no UA)
A single JavaScript file (Firefox UA)
A single CSS file (Firefox UA)
A search done on MSN/Live (IE7 UA)
A search done with the bsalsa.com Embedded Web Browser (nonsense UA)
An inbound link from Wikipedia (infamous AVG 1813 UA)
As mentioned above I do not block any Google IPs, and I am unconcerned by what I assume to be "quality control" checks (JavaScript and CSS) as I accept that they have to do this occasionally.
The last two items, however, were intercepted because of their user-agents.
The MSN/Live search is intriguing.
...
Samizdata - an interesting list. As I mentioned, you can add MSIE varieties to that list. It would be worth passing through sitemaps in that IP range but I'd probably return a 403 for the others. I don't believe they are doing quality checks - not with all of that UA baggage. My take would be floating a proxy of some kind.
OR... Could it be click-throughs from the Cache option on google listings?
Interesting observation re: blocking bsalsa. I've been blocking it for some time, again because of its advertised download capability. I was going to ask here about it a couple of days ago but ran out of time.
Are you sure the wiki one was AVG? If it was the "user-agent:" UA it could have been a bot or whatever, as mentioned in my new thread "User-Agent: Mozilla". Or it could have been an AVG forgery - I got multiple hits from one of those a couple of days ago.
I don't believe they are doing quality checks
I have always assumed that when Mountain View fetches a single JavaScript or CSS file (which Googlebot is forbidden to do in my robots.txt) it is a way of checking that nothing untoward is going on - they have to do it somehow, after all, and I don't find it too intrusive.
Could it be click-throughs from the Cache
Not in my case.
Are you sure the wiki one was AVG?
I am sure it was the AVG user-agent, which was enough to get my "robots policy" file.
I would stress that the list I gave is for one IP only, and was meant in part to illustrate the impossibility of addressing the various Google tools by IP address (Don's original question).
I haven't investigated bsalsa in any depth as it rarely turns up.
...