Forum Moderators: open
[webmasterworld.com...]
there is a bot with user agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Google Wireless Transcoder;)
that does not obey robots.txt so it fell into my spider trap. Still, its
IP address 216.239.58.136 does belong to Google,
[ws.arin.net...]
so this Google bot violates Google's own supposed robots.txt conformance?
Many of these things are doing HTML to WAP translations for internet-enabled cell phones.
You also may need to make allowances in the spider trapping logic, because these things tend to prefetch most of the objects and links on a page, to compensate for the relatively slow load time of the phones.
Jim
Still, even though Google could formally claim that robots.txt
adherence is not required here, it would be recommended if Google
Wireless Transcoder followed robots.txt to suit all existing spider
traps. Otherwise, one has to find out more or less through trial
and error which IP addresses (of Google Wireless Transcoder) to
*not* trap, because anyone can fake the Google Wireless Transcoder
user agent. It is much more elegant and appropriate if this gets
solved on Google's side, so I hope they'll read this!
It seems that the Google Wireless Transcoder also does not use the
X-moz: prefetch header for which I already check, so in my opinion
Google Wireless Transcoder is currently still a pretty badly behaving
what-ever-you'd-call-it. It seems to just prefetch no matter what,
and on top of that not care about robots.txt.
Enough.
Googlebot from googlebot.com? Cool. Anyone/thing else from G's IPs? A text 403 with my e-mail address.
Just as with spam, where the volume of e-mail traffic is now
larger than that of legitimate e-mail traffic, we may in a few
years be in a situation where the majority of website traffic
is from bots pretending to do legitimate Google searches. It
is again Google that can of course cut down the number of
searches that any particular IP address can do, but it gets
harder with large zombie networks with many thousands of IP
addresses running fake searches via Google to harvest the best
content pages on popular keywords. How can a simple webmaster
stop this without blocking (or discouraging via some log-in)
most legitimate traffic at the same time?
Here are two examples. The first is someone using a legit Host/IP (X'd out here) finding my home page (/) via referer google.com. The second is someone searching through a Google IP (note the robots.txt), with no referer. The latter is suspect as heck, thus the 403.
OKAY:
XXX.XX.XXX.XXX
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Date Page Status Referer
05/21 12:05:31 / 200 [google.com...]
NOT OKAY:
64.233.172.18
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3
Date Page Status Referer
05/08 17:25:13 /robots.txt 403 -
Does that distinction help?
.
P.S. to wilderness
The tags don't stop prevent misuse/abuse of G's IPs as proxies.
wilderness: I know I can prevent Google caching, and I am seriously considering
it now because this morning I found that Google Wireless Transcoder behaves much
like a scraper. I got a referer URL at [google.com...] that linked
to a GWT cache page contained all my page content, with images, but minus the
adsense ads! It was fully cached because I accessed this GWT page from behind
a proxy that (deliberately) blocks my own site. GWT also appears to strip out any
Javascript that the normal Google search engine cache still preserves and that
normally throws visitors back to my original site. It is all getting rather ugly
with GWT. One peculiarity was also that the above referer URL got the visitor to
fall into my spider trap, even though Google's GWT page did not (perhaps not yet,
in view of caching lag) contain the trapping URL, but the page at the referer URL
did show a text link to my original page that does contain the trap. So it looks
like GWT was used to find "interesting pages" which were then harvested outside
the GWT cache, i.e. from my original website. In this case the bot (?) then still
followed further links and fell into my spider trap, but it could just as well
have only used the trapless GWT cache.
Pfui: I need to think a bit more about what you said. I was thinking of the
situation where the bot behaves much like a real user, entering keywords into
Google to find the best pages on any topic and then just loading these single
pages without following further links, such that robots.txt will play no role
(Google just handled that part while indexing). With sufficient variation in
IP addresses (large zombie nets) that will be hard to track down as a pattern
both for individual webmasters and for Google, while the accumulated top
page content harvested by the zombie net would be quality scraper site feed.
In other words, I'm not so sure that I can reliably distinguish "legit visitors
searching via 'regular' google.com" from bots doing the same. I don't think it
is a big deal right now, but it could be in a few years. So here it is not a
matter of using Google tools as a real proxy but just using Google to get to
the highest quality and most popular page content within the constraints imposed
by robots.txt, and then fetching that selected content directly with a large
variation in IP addresses through a zombie net - the same zombie net that is
used to ask Google many queries without Google noticing that is it under the
control of one entity - the zombie net owner spreading out Google queries for
top quality content that might be used for or sold to scraper/MFA sites.
I know I can prevent Google caching, and I am seriously considering it now
simplicity,
Many folks "stick to their guns" that accurate visitor stats are not possible because of providers caching pages.
Although setting up pages for NO CACHE are against the premise of the internet, it's a damn useful tool for webamsters who have an interest in accurate stats and identifying visitors.
Of course, not every robot may be eliminated from cache, nor, may this process be implemented overnight.
In the end, and IMO it's a very useful tactic.
BTW, the majority of my own pages are very long text/articles and these phone type tools have limits on page sizes. I haven't had much success from the beginning with these types of browsers and as a result have most of them denied.
It's a personal choice and likely not for every webmaster.
2nd BTW, there is a crawler that uses the UA as "GWT" that I've had denied for some time.
Don
I checked and I could readily access one of my pages
that has had
<META NAME="robots" CONTENT="noarchive">
for a long time, by just entering its URL into
[google.com...]
and from behind a proxy that blocks my direct URL,
so one might conclude that GWT acts as a proxy itself,
or else Google would be violating the "noarchive".
However, I had also noticed that GWT would serve pages
that did not yet have the changes (including spider
trap links) that I added in the last few days, so it
seemed to really cache things too for at least a number
of hours or days.
I can now block entry points via www.google.com/gwt/n
using
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} Google\ Wireless\ Transcoder
RewriteRule .* - [F]
Via another route I could at first still get to a
cached version of the same page at GWT, but a few
hours later that link too gave a proper 403, so
maybe the question now is how long or what events it
takes before a cached GWT page expires - possibly a
cache coherence check is first triggered by a URL
access, which might explain why I first got pages
with content that was at least a few days old. I can
understand now why Pfui is blocking a lot of this
Google stuff. It is difficult to really oversee the
consequences of the alternative GWT entry points into
one's website in relation to legitimate visitors and
possible (ab)uses by bad bots. Reverse engineering
this through trial and error takes a lot of time.
P.S. Thanks Don, your comments came in after I wrote
the above. I'm now denying the GWT UA too, at least
for the time being and until my insights change.