Forum Moderators: open
This one lets people run their web apps using Google's infrastructure. According to the YouTube video this is an official Google project. To be honest, this reeks of log spam. It hit the default root page on every one of my websites. No robots.txt.
Has anyone else seen this yet?
Date: 08/May/2009 08:47:08
IP: 64.233.172.6
UA: AppEngine-Google; (+http://code.google.com/appengine)
Referer: [(subdomain).appspot.com...]
Subdomain was for an apache log tool.
April, 2008: 64.233.172.2
AppEngine-Google; (+http://code.google.com/appengine)
June, 2008: 66.249.84.15
AppEngine-Google; (+http://code.google.com/appengine)
July, 2008: 74.125.16.37
AppEngine-Google; (+http://code.google.com/appengine)
I sent it packing way back when. A quick skim of its then-activity:
Hits all file types, including favicons and .js files, but referers switch too rapidly -- e.g., in two seconds -- to be real people in real time. Where referers exist, they were all Google-related/hosted:
translate.google.com
downforeveryoneorjustme.com
fetchserver1.appspot.com
robots.txt? NO
*For more info, search this site for: AppEngine-Google
: )
ff-in-f67.google.com
Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9,gzip(gfe) (via translate.google.com)
Also: A Script is mirroring my clients site with google translate? [webmasterworld.com]
Another service I've seen they're using in the same way is the w3c validator. So I had to block that too.
So far are easy to block as these ips do not rdns properly.
On the translator/webapp topic -- I have a love-hate thing going on. I see legit people using Google's and others' translators, but the risks are real. Plus I deny all non-html files to Google's IPs so the pages that translation and/or any webapp users see would be really wacky.
The website it hit belongs to an author whose site I host. It did a GET on the default root page only.
The referrer is from a site with a bazillion links on the page. None of which point to the above website. Instead, each of the links points to a different sub-domain on the parent domain, each of which also has a bazillion links on it. None of the linked sites appears to serve any information of value. They're just full of links to other sites on sub-domains of the parent domain. Is this what's called a link farm? I always though those linked to sites on other domains, not sub-domains like these are. But I'm hardly an expert on this kind of thing.
So for now I'm just watching it to see if it returns, and if so what it does and what additional info it gives me.
maybe an ipod? it tried to grab media directly from outside my webpage. If I'm nto letting google images do this, or any one else via leech setting then what is this dork doing?
After seeing many iffy UAs and at least one bot visit via translate.google.com (or the gadget/toolbar)
Most people blindly allow anything coming from Google IPs.
I only allow Google IPs that are actually the Googlebot, or other plainly identified crawler from Google that I allow.
Everything else is treated like a data center and blocked simply because of the rampant abuses.
If you want to allow Google's translator you can filter it more narrowly by using the FORWARD proxy field and track access per IP being forwarded via Google, gives you a little more room to play than just block it altogether.
FYI, Babelfish from Yahoo is gamed the same way.
how do you guys block it if it doesn't respect robots.txt?
For the appengine-google in particular you could use the ip-ptr-ip method and get rid of the scrappers/spamers. That doesn't do much though they can hit you from other ips.