-- Search Engine Spider and User Agent Identification
---- Google App Engine
incrediBILL - 5:35 am on Jul 15, 2012 (gmt 0)
adding a few new ones each week it seems
See, that's why whitelisting user agents beats blocking.
While your list gets longer and longer, mine stays the same length ;)
Not to mention the fact they don't get access the first time which means no infringement has yet happened which may not be the case if you're blocking them after the fact.
Anywho, I don't know much about AppEngine but note most of those apps have "proxy" attached and what a clever way to bypass many of the absurdly simplistic bot blocks out there that simply whitelist the Google IP ranges than to make a proxy that operates from within those ranges.
The most clever use of this technique I've ever seen was back a few years ago before Google tightened up some of their security. Someone set their user agent to Googlebot and then used the Google Translator to scrape pages which was successful on many sites, just not anyone using full trip DNS verification for Googlebot which obviously failed on the Google translator.
What cracks me up is those that whine that full trip DNS verification is too slow. Some blindly do RDNS on everything coming to their server which is insane, and of course slows everything down. However, if you only validate just bots and then cache the DNS results for a day it's only slow once per IP verified, which out of many pages crawled daily, is quite acceptable. Since you're only doing full trip verification for bots it has zero impact on regular visitors.
The implementation is what makes or breaks the usefulness of the method and what separates real programmers from script kiddies :)