Forum Moderators: open
- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:
<html>
<head>
</head>
<body>
</body>
</html>
----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:
NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO
feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO
Twitturly / v0.5
robots.txt? NO
YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO
YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes
Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO
PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES
EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES
Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO
TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO
Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO
Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES
yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO
Mozilla/5.0
robots.txt? NO
Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES
TinEye
robots.txt? NO
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES
nnn/ttt (n)
robots.txt? YES
AideRSS/1.0 (aiderss.com)
robots.txt? NO
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO
----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO
WebClient
robots.txt? YES
----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:
Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.
We've determined that an Amazon EC2 instance was running at the IP address you provided in your abuse report. However, the instance has since been terminated. You shouldn't see any further abusive activity from the IP address you submitted.
Thanks again for alerting us to this issue
If, once your site has a modified robots.txt file in place, web crawls continue from EC2 address space, please file another abuse form. Failure to follow robots.txt is a violation of the EC2 Terms of Use, and will be handled as Abuse Instance (or abuse report).
Dear Amazon EC2 customers,
We are pleased to announce that as part of our ongoing expansion, we have added a new public IP range. The current Amazon EC2 public address ranges are:
US East (Northern Virginia):
216.182.224.0/20 (216.182.224.0 - 216.182.239.255)
72.44.32.0/19 (72.44.32.0 - 72.44.63.255)
67.202.0.0/18 (67.202.0.0 - 67.202.63.255)
75.101.128.0/17 (75.101.128.0 - 75.101.255.255)
174.129.0.0/16 (174.129.0.0 - 174.129.255.255)
204.236.192.0/18 (204.236.192.0 - 204.236.255.255) [previously 204.236.224.0/19]
US West (Northern California):
204.236.128.0/18 (216.236.128.0 - 216.236.191.255)
EU (Ireland):
79.125.0.0/17 (79.125.0.0 - 79.125.127.255)
Sincerely,
The Amazon EC2 Team
ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" "" ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-" ## BLOCK *Faked* blank referer -OR- UA (malicious agents supply a literal hyphen as UA string)
RewriteCond %{HTTP_REFERER}<->%{HTTP_USER_AGENT} ^<->$
03/Aug/2010:12:58:31 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:04:10:08 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:10:24:00 -0700] "GET /fileA.html HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:12:12:16 -0700] "GET /fileB.html HTTP/1.1" 403 1468 "-" ""
[05/Aug/2010:12:18:08 -0700] "GET /FileB.html HTTP/1.1" 403 1468 "-" "bitmagicbot/0.1 admin@bitmagic.in"