I hate cloaking. Alas, most majors and former majors cloak like crazy. Here's a bunch of bot-spotter minutiae, in no particular order...
-----
GOOGLE -----
The empire-builder uses bare IPs and no UAs to hit (& hit & hit) favicons for sites added to its Webmaster Tools pages (Home list, site Dashboards, etc.). For example --
72.14.192.68 - - [03/Sep/2010:10:11:39 -0700] "GET /favicon.ico HTTP/1.1" 200 5430 "-" "-"
Additional IPs used the exact same way:
72.14.192.68
72.14.212.81
72.14.212.82
72.14.212.85
72.14.212.87
74.125.154.81
74.125.154.85
All bare IPs, no UAs, no robots.txt, no REF, no nothing. (sighs) And don't get me started on G's Code and Labs creations, a la:
74.125.154.85
AppEngine-Google; (+http://code.google.com/appengine; appid: linksalpha)
robots.txt? NO
2010 [
webmasterworld.com...]
-----
IBM -----
Like the Energizer Bunny, .watson.ibm.com just keeps going, and going, and going... For what purpose? Beats me.
2010 [
webmasterworld.com...] [
webmasterworld.com...]
2009 [
webmasterworld.com...]
-----
MSN -----
This thread and its predecessor [
webmasterworld.com...] aren't the only reports of MSN's cloaking:
2009 [
webmasterworld.com...]
And here's a little oddity from last January: Microsoft's portal domain came a'crawlin':
gig4-2.tuk2f-gsr-a.us.msn.net
Microsoft MSN SocialStreams Bot
robots.txt? Yes
Hmm. I guess "Microsoft MSN SocialStreams Bot" is "Microsoft Bing Mobile SocialStreams Bot" now?
2010 [
webmasterworld.com...]
-----
DOW JONES -----
Multiple IPs/server farms... Again for what purpose? Dunno.
2009-2010 [
webmasterworld.com...]
-----
YAHOO -----
Too many years, too many probs. Just last month:
research-mm10.corp.sp1.yahoo.com
Firefox 4.0
robots.txt? NO
Just today, no UA:
ycar3.mobile.sp1.yahoo.com - - [02/Sep/2010:07:47:55 -0700] "GET / HTTP/1.1" 403 702 "-" "-"
Oh, and HEAD requests, too. Newly atypical for Slurp on my sites. And redundant: This file didn't change in 30 secs:
llf531077.crawl.yahoo.net - - [21/Jul/2010:17:08:38 -0700] "HEAD /dirA/filenameA.html [snip]"
llf531077.crawl.yahoo.net - - [21/Jul/2010:17:09:08 -0700] "HEAD /dirA/filenameA.html [snip]"
'dev' subdomains from .corp.yahoo.com historically problematic, too:
18ndev96.yst.corp.yahoo.com
sedev1039.yst.corp.yahoo.com
Re both of those:
Mozilla/5.0 (compatible; Yahoo! Slurp; [
help.yahoo.com...]
robots.txt? NO
But wait! There's more!
2009-2010 [
webmasterworld.com...]
-----
BAIDU -----
Here's a thread from 2009 and they're still at it. 'Nuff said: [
webmasterworld.com...]
-----
Okay, that's enough. But that's not all... All of the above are but a miniscule fraction of hits from cloaked start-ups, wanna-bes, Twitter swarmers, student projects, semi-clueless individuals and denizens of the cesspool that is AmazonAWS [
webmasterworld.com...]
Solution? I err on the restrictive side when it comes to anyone wasting my and/or my clients' bandwidth, even more so when it comes to crawling for unknown reasons. No read/heed robots.txt? 403
Thoughts?