Welcome to WebmasterWorld Guest from 54.146.1.178

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

amazonaws.com plays host to wide variety of bad bots

Most recently seen: Gnomit

     
3:04 am on Jan 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2038
votes: 1


ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

2:43 am on Nov 9, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


NetcraftSurveyAgent has been around for a few years at least, originally hailing from lager.netcraft.com using the same UA and typically HEAD-requesting:

/icons/apache_pb.gif

That's path's above the webspace but the file is accessible (and the URI not easily blocked, imho), because the file's an Apache OS image. I think it's really, really sneaky Netcraft probes that way.

Also, Netcraft's been bot-running from AWS since, oh, early this year. Regardless of host, it never asks for robots.txt. Then again, the two robots.txt files they use (search their site for: robots.txt) are syntactically incorrect/ineffectual.

[edited by: Pfui at 2:45 am (utc) on Nov 9, 2010]

2:45 am on Nov 9, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


AWS-based bot-runners are so clever. Not.

ec2-204-236-129-39.us-west-1.compute.amazonaws.com
lqqBithnar0 qlmqd yhyc

robots.txt? NO
11:14 pm on Nov 9, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Three seconds, three HEAD request hits, zero robots.txt:

ec2-184-73-159-0.compute-1.amazonaws.com
ec2-184-73-159-0.compute-1.amazonaws.com
ec2-184-72-160-50.compute-1.amazonaws.com

All using the site-hosted-on-Amazon (184.73.159.0), social news reader/story aggregator, currently-in-private-beta app:

Summify (Summify/1.0; +http://summify.com)
11:50 am on Nov 27, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Twitter-swarmer:

ec2-184-72-128-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; PaperLiBot/2.1)

robots.txt? NO
2:01 pm on Dec 1, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


A Chinese bot by any other name...

ec2-184-72-5-61.us-west-1.compute.amazonaws.com
vik-robot/Nutch-1.0 (vikspider; http://vik.com; chenlibiti@163.com)

robots.txt? Yes, but after hitting root.

Previously (in this thread; mssg #4052607 by dstiles; 01-2010):
Chen Li/Nutch-1.0 (Nutch spiderman; http://chenli.com.cn; chenlibiti@163.com)
10:03 pm on Dec 24, 2010 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


New (to me) AWS range at RIPE:

46.137.0.0 - 46.137.127.255

First hit from it today pretending to be Moz/4 MSIE 7 on XP with bad headers.
8:27 pm on Dec 25, 2010 (gmt 0)

Junior Member from DE 

10+ Year Member

joined:June 25, 2005
posts:181
votes: 1


+ 46.137.128.0 - 46.137.191.255

Full Amazon range: 46.137/16
8:53 pm on Dec 27, 2010 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


How did I miss that one!

Actually, I'll now block the full /16 as 192/18 is Amazon Data Services Ireland. Must have had an off-day before. :)
1:46 am on Dec 28, 2010 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8900
votes: 402


Thanks for the Amazon Ireland info :)
10:44 pm on Feb 5, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Another Irish Amazon range. This one not listed as AWS/Cloud but the same email domain.

87.238.80.0 - 87.238.87.255
4:37 pm on Feb 20, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1889
votes: 56


Fresh scraper UA: Qryos
from:
inetnum: 122.248.224.0 - 122.248.255.255
netname: AMAZON-EC2-SG
descr: Amazon Web Services, Elastic Compute Cloud, EC2, SG
8:11 pm on Feb 20, 2011 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8900
votes: 402


122.248.224.0 - 122.248.255.255

DNS Stuff doesn't give the CIDR on this range, but I'm assuming it is:

122.248.224.0/19
10:32 pm on Feb 20, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Thanks for the heads-up - first Asian AWS I've got! :)

The full range is actually 122.248.192 - 122.248.255.255
9:30 pm on Feb 22, 2011 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8900
votes: 402



The full range is actually 122.248.192 - 122.248.255.255


OK then: deny from 122.248.192.0/18
4:26 pm on Mar 8, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Another range, this time Singapore (compiled from 4 Whois records):

inetnum: 175.41.128.0 - 175.41.255.255
netname: AMAZON-AP-RESOURCES-SG
descr: Amazon Web Services, Elastic Compute Cloud, EC2, SG
remarks: The activity you have detected originates from a dynamic hosting environment.
2:18 am on Apr 23, 2011 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


Ban them as amazonaws.com via htaccess and be done with it.
8:12 pm on Apr 23, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


On an old Windows 2003 system?

MSDOS was deisgned on the principle: take the best from Linux/Unix; take the best from CPM; throw all that away; now make it up as you go along. It's taken over 30 years for MS to add something even close to rewrite. Sadly, in my early web days I listened to a Microsoft-qualified guy and left Linux. I have too much ASP library code now to return to a linux server.

I wouldn't be sure they would always show up as amazonaws.com anyway.
3:44 am on Apr 28, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 26, 2006
posts: 1619
votes: 0


Here's teh most recent Amazon Ranges:
Dear Amazon EC2 customer,

We are pleased to announce that as part of our ongoing expansion, we have added a new public IP range. The current Amazon EC2 public address ranges are:


US East (Northern Virginia):


216.182.224.0/20 (216.182.224.0 - 216.182.239.255)

72.44.32.0/19 (72.44.32.0 - 72.44.63.255)

67.202.0.0/18 (67.202.0.0 - 67.202.63.255)

75.101.128.0/17 (75.101.128.0 - 75.101.255.255)

174.129.0.0/16 (174.129.0.0 - 174.129.255.255)

204.236.192.0/18 (204.236.192.0 - 204.236.255.255)

184.73.0.0/16 (184.73.0.0 184.73.255.255)

184.72.128.0/17 (184.72.128.0 - 184.72.255.255)

184.72.64.0/18 (184.72.64.0 - 184.72.127.255)

50.16.0.0/15 (50.16.0.0 - 50.17.255.255)

50.19.0.0/16 (50.19.0.0 - 50.19.255.255)


US West (Northern California):


204.236.128.0/18 (216.236.128.0 - 216.236.191.255)

184.72.0.0/18 (184.72.0.0 184.72.63.255)

50.18.0.0/17 (50.18.0.0 - 50.18.127.255)


EU (Ireland):


79.125.0.0/17 (79.125.0.0 - 79.125.127.255)

46.51.128.0/18 (46.51.128.0 - 46.51.191.255)

46.51.192.0/20 (46.51.192.0 - 46.51.207.255)

46.137.0.0/17 (46.137.0.0 - 46.137.127.255)


Asia Pacific (Singapore)


175.41.128.0/18 (175.41.128.0 - 175.41.191.255)

122.248.192.0/18 (122.248.192.0 - 122.248.255.255)


Asia Pacific (Tokyo)


175.41.192.0/18 (175.41.192.0 - 175.41.255.255)

46.51.224.0/19 (46.51.224.0 - 46.51.255.254)NEW
7:37 pm on Apr 28, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Got all of those (he said smugly!) :)

I've also lumped in 46.137.192.0 - 46.137.255.255 (Ireland data services unspecified) with the AWS. I don't need them calling.
7:51 pm on Apr 28, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 26, 2006
posts: 1619
votes: 0


I knew you did dstiles (she grinned wickedly)
3:19 pm on May 1, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


:)

The following hit my site today but although its range belongs to Amazon it is apparently not AWS. I blocked the range anyway.

IP: 207.171.191.60
UA: Jakarta Commons-HttpClient/3.0

Blocked range: 207.171.160.0 - 207.171.191.255 (207.171.160.0/19)
5:16 am on May 2, 2011 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8900
votes: 402


@ dstiles

Jakarta Commons-HttpClient/3.0 is often used to check link validity. Amazon may be verifying links in their A9 index. Just a thought.
4:55 pm on May 2, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Could be, but I block that UA anyway (jakarta AND HttpClient), plus most "random" link checkers.
11:23 am on May 25, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


50.19.134.zz - - [25/May/2011:04:59:21 -0600] "GET / HTTP/1.0" 403 316 "Mysite/MyPage" "Firefox/2.0"
9:19 pm on May 25, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


50.16.0.0 - 50.19.255.255 (50.16/14)

Firefox 2 is so old it's a danger to everyone, even were it genuine.
6:47 pm on June 2, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 838
votes: 0


This is a fascinating topic! Thank you Pfui for starting it.

I've been looking at these cloud-based bots for awhile now and the only reason I haven't yet pulled the plug on anything hosted at AWS is because I'm trying to think of any legitimate use of it by someone I'd be interested in (an actual human, Googlebot, msnbot, Slurp).
The only legit use I can think of is that there should be (a bunch of) corporate VPNs with cloud-based access points which may actually have real users behind them.

Am I missing any other legitimate cloud-computing based traffic?

So, do you guys just disallow any IP that belongs to AWS (I am tempted to, to be honest) or do you determine based on IP/behavior? I can think of a situation where Amazon simply recycles IPs after a virtual machine is shut down and so there must be quite a churn of IPs in this huge system.

What's the best way to disallow such a huge swath, anyway : at the firewall level, Apache config or .htaccess ?
10:01 pm on June 2, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


I (and many others) block ALL AWS ranges - in fact I go further and block all Amazon ranges, AWS or not. And any other cloud I can detect (there are quite a few - MS, Google, various ISPs...

If you have a suitable firewall you could use that. If you have an apache server then use .htaccess (I think - I'm not a linux server person). There are others who are better qualified to answer that and I think it's been answered a few times hereabouts.
5:29 am on June 3, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


What's the best way to disallow such a huge swath,


It may appear to be large range of IP's, however considering the vastness of the overall www, AWS is a small-fry.
4:00 pm on June 13, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Just passing by to report yet another --

ec2-174-129-52-170.compute-1.amazonaws.com
Mozilla/5.0 (compatible; q1; +http://www.qleeq.com; info@qleeq.com)

robots.txt? Yes
8:43 pm on June 13, 2011 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Good to see you around again, pfui!

174.129/16 - block. :)
This 278 message thread spans 10 pages: 278