Welcome to WebmasterWorld Guest from 23.20.137.66

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

amazonaws.com plays host to wide variety of bad bots

Most recently seen: Gnomit

     
3:04 am on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

11:54 pm on Feb 18, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ec2-184-73-16-198.compute-1.amazonaws.com
Nutch/Nutch-1.0-dev+(A+Nutch-based+crawler.;+http://lucene.apache.org/nutch/bot.html;+nutch-agent+AT+lucene.apache.org)

robots.txt? Yes - ignored it.

Went after Homepage and left with a fat 403.
11:54 pm on Feb 25, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



From the ddanchev security blog re: spamvertised casino games serving up viruses:

"What's particularly interesting about the campaign, is the fact that all of the domains serve identical template, with the (virusname) binary hosted 'in the cloud' thanks to Amazon's Web Services."
5:40 am on Feb 28, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-174-129-65-46.compute-1.amazonaws.com
Mozilla/5.0 [Internet Explorer]

robots.txt? NO
4:33 am on Mar 2, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-67-202-4-244.compute-1.amazonaws.com
Java/1.6.0_16

robots.txt? NO

Made a beeline for two dynamically generated files. HEAD requests.
5:10 am on Mar 14, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-204-236-232-79.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:05 /robots.txt

ec2-184-73-40-108.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:06 /robots.txt
3:18 am on Mar 20, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-75-101-172-69.compute-1.amazonaws.com
Ruby EventMachine

robots.txt? NO
2:36 pm on Mar 22, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-67-202-36-161.compute-1.amazonaws.com
JS-Kit URL Resolver, http://js-kit.com/

robots.txt? NO
7:45 pm on Apr 8, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-174-129-248-30.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.4) Gecko/2008102920 http://ow.ly web crawler (.NET CLR 3.5.30729)

robots.txt? NO
5:45 pm on Apr 10, 2010 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




ec2-75-101-149-212.compute-1.amazonaws.com
TrueKnowledgeBot (http://www.trueknowledge.com/tkbot/; tkbot -AT- trueknowledge _dot_ com)

robots.txt: yes
4:39 am on Apr 20, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Twitter-swarmer:

ec2-204-236-201-131.compute-1.amazonaws.com
justsignal/1.0 (+http://justsignal.com)

robots.txt? NO
Fake ref/Log spam? YES, the UA's Host
6:54 am on Apr 20, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Twitter-swarming 'web miner':

ec2-174-129-76-46.compute-1.amazonaws.com
Mozilla/5.0 (compatible; ptd-crawler; +http://bixolabs.com/crawler/ptd/; crawler@bixolabs.com)

robots.txt? Yes
11:01 pm on Apr 24, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-204-236-236-136.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Firefox Addon; Windows XP 5.1)

robots.txt? NO

Four hits to / in five seconds (403s ignored).
9:33 pm on Apr 30, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-72-44-62-4.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3

robots.txt? Yes
5:31 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-174-129-75-189.compute-1.amazonaws.com
research-scan-bot/Nutch-1.0

robots.txt? Yes
7:21 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO
3:08 pm on May 16, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-184-73-120-39.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)

robots.txt? NO
10:51 pm on May 31, 2010 (gmt 0)

5+ Year Member



Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO


Hey,
I'm the developer of the current acquia crawler. It actually has no shared code with the "old" one.
I'm sorry that it didn't check the robots.txt first. It SHOULD check the robots.txt first (and in general only look at the front page ("/") of the website).
It would be nice if you could PM me the URL of your site so I can see what might have caused this.
5:17 pm on Jun 21, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Real Yahoo doubtful, double-cloak, or double-cross? You be the judge...

ec2-184-73-3-35.compute-1.amazonaws.com
Mozilla/5.0 [en] (X11; U; Linux 2.2.15 i686 +http://www.yahoo.com/index.html)

robots.txt? Yes, but promptly ignored.

06/21 02:30:13 /robots.txt
06/21 02:30:17 /
9:31 am on Jun 22, 2010 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Pfui... you are our resident aws expert. Have you considered updating your original post with new info? 168 messages to wade through... well, yeah, I suppose that's why we're here... the interaction. Yet, the issue is getting out of hand! As Boris Karloff used to say: BWAHAHAHAHA!
3:32 am on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



tangor, long story short:

Block/deny/rewrite .amazonaws.com by name

: )

And now, from the Department of Tweet ReReReRedundancy Department:

ec2-184-73-67-107.compute-1.amazonaws.com
UA: NONE
robots.txt? NO

Req: HEAD
Fake Ref: @hourlypress

(Tweets upon retweets upon reretweets remind me more and more of e-mailed chain letter plagues.)
8:49 am on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I must be missing something because I still don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?

Amazonaws.com is in almost all of my sites referral logs with lots of visits day after day, week after week, and month after month.

Why in the world does Amazon want to visit my sites in the first place? For what purpose? How do they know about my url's, especially the new sites? Is it somehow connected to the amazon.com affiliate program (however, the many sites they visit are not using the affiliate program and never did)?

Sometimes they visit a brand new site I just put online even before anyone else does or the site gets indexed anywhere.

How would Amazon know that my url was just put online? Sometimes they are my #1 traffic source! How and why is this happening? Please excuse my lack of knowledge on this subject.
6:21 pm on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@ trader

This thread is speaking of amazonaws.com, a virtual computing service offered by Amazon. The actual owners of these agents that come to your web sites are customers of Amazon.

I suggest reading this entire thread as well as doing some searches to gain a better understanding of what happens at a cloud service. There are others besides amazonaws.com.
7:45 pm on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks. I did read or scan most of the thread (but often too technical for me) but still do not understand why they are visiting my sites so much and what they have to gain by all the hits? Why is it worth the effort?

Also confused about how they even know about my websites in the first place (an issue I do not see answered) especially knowledge of my brand new site URL's just developed, and not indexed anywhere yet. Sometimes I see amazonaws.com in my log files as one of the first visitors to my new site just put online the day before!

Sorry I am not too educated on this or about cloud service which I never heard of before you mentioned it. P.S. I am not a programmer or a geek (obviously).
9:04 pm on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



In a nutshell: "amazonaws.com" is an alias used by a multitude of anonymous AWS customers.

Amazon, like others, offers 'cloud' computing services where companies/people rent virtual server space. This thread is testament to the stunning number of bad robots/crawlers/crap apps hiding in the amazonaws.com cloud.

And those "amazonaws.com" hits you're seeing? Someone's using your resources for secret reasons.
10:07 pm on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



To add to the list of bots from aws:

NetcraftSurvey, previously running from UK BT block 194.72.238.0/24 is now on aws with a new bot UA. At least, I think it's the same service: the new one includes netcraft.com in the UA and the old one had a referer of netcraft.com - they may serve different functions.

New UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)

It's possible the old one is still running. First time I saw this was today.
3:29 pm on Jul 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Pfui, but what in the world could some of the the "secret reasons" possibly be? Maybe to send spam is one of them I am guessing?

We are aware of lots of spams being sent using disguised headers and claiming it's from me and from a (far from obvious unique name that could not be guessed) email address configured on the server.

However, those email addresses are not shown or listed on the web anywhere but only created when the site was setup on the server because the setup process wanted an email address to be entered (though we never used it or intended to use it).
4:39 pm on Jul 11, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Probably not quite so nefarious. I reckon secret reasons are more likely trade secret-related -- companies/people data-mining to build or improve their hoped-for better mousetraps.

I get why they hide. But when they use/abuse my stuff, I want WHOIS, if not Who, What, When, Where, Why. Because, heck, if their aims aren't nefarious, why not?

(A bit OT: Spam-wise, faked addresses can be from anybody anywhere. But with yours that private to begin with... I dunno. Perhaps your server was more vulnerable at some point that it is now?)
8:49 pm on Jul 11, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



My experience of AWS is that it hosts a myriad of site scrapers or trivial bots (eg dozens of twitter/facebook me-too bots) that have no obvious use and no definite IP to block if we don't like them. My logs are full of the things, all blocked. Many of them pretend to be browsers, although certain characteristics belie that.

If a bot can't identify itself properly and supply a fixed address it's toast! I don't care who or what it is.
1:04 am on Jul 13, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



A new addition to the post-Tweet swarm:

ec2-174-129-180-142.compute-1.amazonaws.com
my6sense/1.0

robots.txt? NO
1:36 am on Jul 13, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



What a waste of log space. All post-Tweet HEAD hits to the exact same file. (403s obviously ignored.) No robots.txt 'natch -- or ever, re most UAs from amazonaws.

ec2-204-236-206-79.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:45:24
07/12 17:45:34
07/12 17:45:35
07/12 17:45:36

ec2-204-236-254-109.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:48:01
07/12 17:48:07
07/12 17:48:08
07/12 17:48:09
07/12 17:54:41
07/12 17:54:58
07/12 17:55:16
07/12 17:55:39
07/12 17:56:28
07/12 17:57:07
07/12 17:57:48
07/12 17:58:35
This 278 message thread spans 10 pages: 278
 

Featured Threads

Hot Threads This Week

Hot Threads This Month