Once this month. It got fed a 403 from a previously trapped unwelcome UA on the same IP and went sulking. I may or may not release the IP but the bot is already in the trap list.
Date: 08/May/2009 08:47:08
UA: AppEngine-Google; (+http://code.google.com/appengine)
Subdomain was for an apache log tool.
It's been around for a while.* From my notes:
April, 2008: 184.108.40.206
June, 2008: 220.127.116.11
July, 2008: 18.104.22.168
I sent it packing way back when. A quick skim of its then-activity:
Hits all file types, including favicons and .js files, but referers switch too rapidly -- e.g., in two seconds -- to be real people in real time. Where referers exist, they were all Google-related/hosted:
*For more info, search this site for: AppEngine-Google
I've been seeing it for a while also. Not a bot so no robots.txt. I've allowed it (as with all other Google UAs) on whitelist of authenticated IPs.
After seeing many iffy UAs and at least one bot visit via translate.google.com (or the gadget/toolbar [translate.google.com.tr]), I decided to err on the restrictive side with its kin. Last month:
Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/22.214.171.124,gzip(gfe) (via translate.google.com)
Also: A Script is mirroring my clients site with google translate? [webmasterworld.com]
they are using the translators as convenient scrap services. They can translate back and forth and automatically generate "new" content.
Another service I've seen they're using in the same way is the w3c validator. So I had to block that too.
So far are easy to block as these ips do not rdns properly.
Pfui, that thread you linked to was a real eye-opener for me.
You all are a never-ending font of knowledge for me. I'm most appreciative. Thanks.
Gary, thanks back atcha! I constantly learn from you and everyone here. Now if only I knew as much as y'all combined:)
On the translator/webapp topic -- I have a love-hate thing going on. I see legit people using Google's and others' translators, but the risks are real. Plus I deny all non-html files to Google's IPs so the pages that translation and/or any webapp users see would be really wacky.
So am I correct in understanding that Google App Engine is comparable to Amazon's cloud computing, in that you never know whose really behind the IP address?
Based on what I've seen I'm not sure how to treat it.
The website it hit belongs to an author whose site I host. It did a GET on the default root page only.
The referrer is from a site with a bazillion links on the page. None of which point to the above website. Instead, each of the links points to a different sub-domain on the parent domain, each of which also has a bazillion links on it. None of the linked sites appears to serve any information of value. They're just full of links to other sites on sub-domains of the parent domain. Is this what's called a link farm? I always though those linked to sites on other domains, not sub-domains like these are. But I'm hardly an expert on this kind of thing.
So for now I'm just watching it to see if it returns, and if so what it does and what additional info it gives me.
I just found something seemingly stupid coming from google IP range: 74.125.75.xx says in UA is google wireless transcoder and apple webkit
maybe an ipod? it tried to grab media directly from outside my webpage. If I'm nto letting google images do this, or any one else via leech setting then what is this dork doing?
|After seeing many iffy UAs and at least one bot visit via translate.google.com (or the gadget/toolbar) |
Most people blindly allow anything coming from Google IPs.
I only allow Google IPs that are actually the Googlebot, or other plainly identified crawler from Google that I allow.
Everything else is treated like a data center and blocked simply because of the rampant abuses.
If you want to allow Google's translator you can filter it more narrowly by using the FORWARD proxy field and track access per IP being forwarded via Google, gives you a little more room to play than just block it altogether.
FYI, Babelfish from Yahoo is gamed the same way.
I have the same problem - it has grown to be the biggest scraper.
Question: how do you guys block it if it doesn't respect robots.txt? And shouldn't we complain to Google about such behavior?
|how do you guys block it if it doesn't respect robots.txt? |
From my experience a spider can be directed to access a folder or a file regardless of robots.txt (eg: when someone redirects it from one server to another). Based on this I do not rely on the robots.txt. In fact I leave the file blank as it can confuse more than help and use other methods to identify a visitor (like forward/backward ip-ptr resolution, port scan, http headers etc). Unfortunately, some of these methods take long time (even if you do it once per ip) and I haven't figured a way to do it efficiently in real-time.
For the appengine-google in particular you could use the ip-ptr-ip method and get rid of the scrappers/spamers. That doesn't do much though they can hit you from other ips.