Forum Moderators: open

Message Too Old, No Replies

Google Wireless Transcoder (again)

from Google, but does not obey robots.txt

         

simplicity

7:27 pm on May 17, 2006 (gmt 0)

10+ Year Member



As was already reported in (closed thread or else I would have posted there)

[webmasterworld.com...]

there is a bot with user agent

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Google Wireless Transcoder;)

that does not obey robots.txt so it fell into my spider trap. Still, its
IP address 216.239.58.136 does belong to Google,

[ws.arin.net...]

so this Google bot violates Google's own supposed robots.txt conformance?

jdMorgan

11:23 pm on May 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since it's not a robot, it doesn't fetch robots.txt, and it does not need to obey it.

Many of these things are doing HTML to WAP translations for internet-enabled cell phones.

You also may need to make allowances in the spider trapping logic, because these things tend to prefetch most of the objects and links on a page, to compensate for the relatively slow load time of the phones.

Jim

simplicity

8:28 am on May 18, 2006 (gmt 0)

10+ Year Member



Thanks Jim, these are good and interesting points! I could not find
any documentation about Google Wireless Transcoder, so I had assumed
it was a bot of some kind, but I now think you are right.

Still, even though Google could formally claim that robots.txt
adherence is not required here, it would be recommended if Google
Wireless Transcoder followed robots.txt to suit all existing spider
traps. Otherwise, one has to find out more or less through trial
and error which IP addresses (of Google Wireless Transcoder) to
*not* trap, because anyone can fake the Google Wireless Transcoder
user agent. It is much more elegant and appropriate if this gets
solved on Google's side, so I hope they'll read this!

It seems that the Google Wireless Transcoder also does not use the
X-moz: prefetch header for which I already check, so in my opinion
Google Wireless Transcoder is currently still a pretty badly behaving
what-ever-you'd-call-it. It seems to just prefetch no matter what,
and on top of that not care about robots.txt.

simplicity

9:15 am on May 18, 2006 (gmt 0)

10+ Year Member



Well, as a crude workaround and to avoid handling every offending
IP address, I have for the time being decided to white-list blocks
of IP addresses belonging to Google and a few others. "Bad bots"
will still be blocked and "stupid bots" will only be automatically
reported such that I can keep track of the robots.txt violations.
The IP address(es) for Google Wireless Transcoder should now pass,
and just get flagged for tracking purposes.

Pfui

9:46 pm on May 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ironically, I recently started blacklisting all Google IPs because of constant abuses by who-knows-who/what running through them. From robots.txt calls by Firefox users, to tinyurls to protected pages, to 14 Transcoder hits to the exact same page over two hours using two different G IPs, to hundreds of favicon requests every day, day after day, from each Toolbar and Desktop user, to blacklisted folks using G's IPs as personal proxies --

Enough.

Googlebot from googlebot.com? Cool. Anyone/thing else from G's IPs? A text 403 with my e-mail address.

simplicity

7:31 pm on May 21, 2006 (gmt 0)

10+ Year Member



Right Pfui, this scares me as a likely future trend: bots using
the Google tools as "proxies" to hide their identities. This not
only applies to the Google Wireless Transcoder (for which Google
could at least add robots.txt conformance), but it applies also
to the core of Google, their search engine. Bots only need to
generate sets of keywords and Google returns the top content
pages right away, so the bots just need to harvest those pages
without any need to follow internal website links. Still, 90%
or so of my incoming traffic comes through Google searches, so
I clearly do not want to block all Google IP addresses.

Just as with spam, where the volume of e-mail traffic is now
larger than that of legitimate e-mail traffic, we may in a few
years be in a situation where the majority of website traffic
is from bots pretending to do legitimate Google searches. It
is again Google that can of course cut down the number of
searches that any particular IP address can do, but it gets
harder with large zombie networks with many thousands of IP
addresses running fake searches via Google to harvest the best
content pages on popular keywords. How can a simple webmaster
stop this without blocking (or discouraging via some log-in)
most legitimate traffic at the same time?

wilderness

8:10 pm on May 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How can a simple webmaster stop this without blocking (or discouraging via some log-in)most legitimate traffic at the same time?

<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">

<META NAME="robots" CONTENT="noarchive">

Pfui

8:11 pm on May 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



simplicity, not to fret. You can allow legit visitors searching via 'regular' google.com and still deny iffy hits via Google's IPs, because when you deny the IPs by number, google.com (and googlebot.com) are still allowed.

Here are two examples. The first is someone using a legit Host/IP (X'd out here) finding my home page (/) via referer google.com. The second is someone searching through a Google IP (note the robots.txt), with no referer. The latter is suspect as heck, thus the 403.

OKAY:

XXX.XX.XXX.XXX
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Date Page Status Referer
05/21 12:05:31 / 200 [google.com...]

NOT OKAY:

64.233.172.18
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3
Date Page Status Referer
05/08 17:25:13 /robots.txt 403 -

Does that distinction help?

.
P.S. to wilderness
The tags don't stop prevent misuse/abuse of G's IPs as proxies.

simplicity

8:07 am on May 22, 2006 (gmt 0)

10+ Year Member



Thanks wilderness and Pfui.

wilderness: I know I can prevent Google caching, and I am seriously considering
it now because this morning I found that Google Wireless Transcoder behaves much
like a scraper. I got a referer URL at [google.com...] that linked
to a GWT cache page contained all my page content, with images, but minus the
adsense ads! It was fully cached because I accessed this GWT page from behind
a proxy that (deliberately) blocks my own site. GWT also appears to strip out any
Javascript that the normal Google search engine cache still preserves and that
normally throws visitors back to my original site. It is all getting rather ugly
with GWT. One peculiarity was also that the above referer URL got the visitor to
fall into my spider trap, even though Google's GWT page did not (perhaps not yet,
in view of caching lag) contain the trapping URL, but the page at the referer URL
did show a text link to my original page that does contain the trap. So it looks
like GWT was used to find "interesting pages" which were then harvested outside
the GWT cache, i.e. from my original website. In this case the bot (?) then still
followed further links and fell into my spider trap, but it could just as well
have only used the trapless GWT cache.

Pfui: I need to think a bit more about what you said. I was thinking of the
situation where the bot behaves much like a real user, entering keywords into
Google to find the best pages on any topic and then just loading these single
pages without following further links, such that robots.txt will play no role
(Google just handled that part while indexing). With sufficient variation in
IP addresses (large zombie nets) that will be hard to track down as a pattern
both for individual webmasters and for Google, while the accumulated top
page content harvested by the zombie net would be quality scraper site feed.
In other words, I'm not so sure that I can reliably distinguish "legit visitors
searching via 'regular' google.com" from bots doing the same. I don't think it
is a big deal right now, but it could be in a few years. So here it is not a
matter of using Google tools as a real proxy but just using Google to get to
the highest quality and most popular page content within the constraints imposed
by robots.txt, and then fetching that selected content directly with a large
variation in IP addresses through a zombie net - the same zombie net that is
used to ask Google many queries without Google noticing that is it under the
control of one entity - the zombie net owner spreading out Google queries for
top quality content that might be used for or sold to scraper/MFA sites.

wilderness

2:31 pm on May 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know I can prevent Google caching, and I am seriously considering it now

simplicity,
Many folks "stick to their guns" that accurate visitor stats are not possible because of providers caching pages.

Although setting up pages for NO CACHE are against the premise of the internet, it's a damn useful tool for webamsters who have an interest in accurate stats and identifying visitors.

Of course, not every robot may be eliminated from cache, nor, may this process be implemented overnight.

In the end, and IMO it's a very useful tactic.

BTW, the majority of my own pages are very long text/articles and these phone type tools have limits on page sizes. I haven't had much success from the beginning with these types of browsers and as a result have most of them denied.
It's a personal choice and likely not for every webmaster.

2nd BTW, there is a crawler that uses the UA as "GWT" that I've had denied for some time.

Don

simplicity

2:51 pm on May 22, 2006 (gmt 0)

10+ Year Member



I was getting confused with Google Wireless Transcoder.
It looked like a hybrid between a content-transforming
proxy and a corresponding caching mechanism that keeps
the transformed pages for some undetermined period.

I checked and I could readily access one of my pages
that has had

<META NAME="robots" CONTENT="noarchive">

for a long time, by just entering its URL into

[google.com...]

and from behind a proxy that blocks my direct URL,
so one might conclude that GWT acts as a proxy itself,
or else Google would be violating the "noarchive".

However, I had also noticed that GWT would serve pages
that did not yet have the changes (including spider
trap links) that I added in the last few days, so it
seemed to really cache things too for at least a number
of hours or days.

I can now block entry points via www.google.com/gwt/n
using

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Google\ Wireless\ Transcoder
RewriteRule .* - [F]

Via another route I could at first still get to a
cached version of the same page at GWT, but a few
hours later that link too gave a proper 403, so
maybe the question now is how long or what events it
takes before a cached GWT page expires - possibly a
cache coherence check is first triggered by a URL
access, which might explain why I first got pages
with content that was at least a few days old. I can
understand now why Pfui is blocking a lot of this
Google stuff. It is difficult to really oversee the
consequences of the alternative GWT entry points into
one's website in relation to legitimate visitors and
possible (ab)uses by bad bots. Reverse engineering
this through trial and error takes a lot of time.

P.S. Thanks Don, your comments came in after I wrote
the above. I'm now denying the GWT UA too, at least
for the time being and until my insights change.