homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

AppEngine-Google: New Google UA to watch for

 8:24 pm on May 10, 2009 (gmt 0)

AppEngine-Google; ( [code.google.com...]
OrgName: Google Inc.
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US

This one lets people run their web apps using Google's infrastructure. According to the YouTube video this is an official Google project. To be honest, this reeks of log spam. It hit the default root page on every one of my websites. No robots.txt.

Has anyone else seen this yet?



 10:08 pm on May 10, 2009 (gmt 0)

Once this month. It got fed a 403 from a previously trapped unwelcome UA on the same IP and went sulking. I may or may not release the IP but the bot is already in the trap list.

Date: 08/May/2009 08:47:08
UA: AppEngine-Google; (+http://code.google.com/appengine)
Referer: [(subdomain).appspot.com...]

Subdomain was for an apache log tool.


 8:35 am on May 11, 2009 (gmt 0)

It's been around for a while.* From my notes:

April, 2008:
AppEngine-Google; (+http://code.google.com/appengine)

June, 2008:
AppEngine-Google; (+http://code.google.com/appengine)

July, 2008:
AppEngine-Google; (+http://code.google.com/appengine)

I sent it packing way back when. A quick skim of its then-activity:

Hits all file types, including favicons and .js files, but referers switch too rapidly -- e.g., in two seconds -- to be real people in real time. Where referers exist, they were all Google-related/hosted:


robots.txt? NO

*For more info, search this site for: AppEngine-Google

: )


 9:14 am on May 11, 2009 (gmt 0)

I've been seeing it for a while also. Not a bot so no robots.txt. I've allowed it (as with all other Google UAs) on whitelist of authenticated IPs.


 5:04 pm on May 11, 2009 (gmt 0)

After seeing many iffy UAs and at least one bot visit via translate.google.com (or the gadget/toolbar [translate.google.com.tr]), I decided to err on the restrictive side with its kin. Last month:

Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/,gzip(gfe) (via translate.google.com)

Also: A Script is mirroring my clients site with google translate? [webmasterworld.com]


 1:31 pm on May 12, 2009 (gmt 0)

they are using the translators as convenient scrap services. They can translate back and forth and automatically generate "new" content.

Another service I've seen they're using in the same way is the w3c validator. So I had to block that too.

So far are easy to block as these ips do not rdns properly.


 5:03 pm on May 12, 2009 (gmt 0)

Pfui, that thread you linked to was a real eye-opener for me.

You all are a never-ending font of knowledge for me. I'm most appreciative. Thanks.


 7:08 pm on May 12, 2009 (gmt 0)

Gary, thanks back atcha! I constantly learn from you and everyone here. Now if only I knew as much as y'all combined:)

On the translator/webapp topic -- I have a love-hate thing going on. I see legit people using Google's and others' translators, but the risks are real. Plus I deny all non-html files to Google's IPs so the pages that translation and/or any webapp users see would be really wacky.


 11:57 am on May 13, 2009 (gmt 0)

So am I correct in understanding that Google App Engine is comparable to Amazon's cloud computing, in that you never know whose really behind the IP address?


 6:17 am on May 14, 2009 (gmt 0)

Based on what I've seen I'm not sure how to treat it.

The website it hit belongs to an author whose site I host. It did a GET on the default root page only.

The referrer is from a site with a bazillion links on the page. None of which point to the above website. Instead, each of the links points to a different sub-domain on the parent domain, each of which also has a bazillion links on it. None of the linked sites appears to serve any information of value. They're just full of links to other sites on sub-domains of the parent domain. Is this what's called a link farm? I always though those linked to sites on other domains, not sub-domains like these are. But I'm hardly an expert on this kind of thing.

So for now I'm just watching it to see if it returns, and if so what it does and what additional info it gives me.


 2:32 pm on May 17, 2009 (gmt 0)

I just found something seemingly stupid coming from google IP range: 74.125.75.xx says in UA is google wireless transcoder and apple webkit

maybe an ipod? it tried to grab media directly from outside my webpage. If I'm nto letting google images do this, or any one else via leech setting then what is this dork doing?


 10:42 pm on May 17, 2009 (gmt 0)

After seeing many iffy UAs and at least one bot visit via translate.google.com (or the gadget/toolbar)

Most people blindly allow anything coming from Google IPs.

I only allow Google IPs that are actually the Googlebot, or other plainly identified crawler from Google that I allow.

Everything else is treated like a data center and blocked simply because of the rampant abuses.

If you want to allow Google's translator you can filter it more narrowly by using the FORWARD proxy field and track access per IP being forwarded via Google, gives you a little more room to play than just block it altogether.

FYI, Babelfish from Yahoo is gamed the same way.


 7:47 pm on Jun 2, 2009 (gmt 0)

I have the same problem - it has grown to be the biggest scraper.

Question: how do you guys block it if it doesn't respect robots.txt? And shouldn't we complain to Google about such behavior?


 8:29 am on Jun 4, 2009 (gmt 0)

how do you guys block it if it doesn't respect robots.txt?

From my experience a spider can be directed to access a folder or a file regardless of robots.txt (eg: when someone redirects it from one server to another). Based on this I do not rely on the robots.txt. In fact I leave the file blank as it can confuse more than help and use other methods to identify a visitor (like forward/backward ip-ptr resolution, port scan, http headers etc). Unfortunately, some of these methods take long time (even if you do it once per ip) and I haven't figured a way to do it efficiently in real-time.

For the appengine-google in particular you could use the ip-ptr-ip method and get rid of the scrappers/spamers. That doesn't do much though they can hit you from other ips.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved