homepage Welcome to WebmasterWorld Guest from 54.211.47.170
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 [6] 7 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

blend27




msg:4083220
 11:54 pm on Feb 18, 2010 (gmt 0)

ec2-184-73-16-198.compute-1.amazonaws.com
Nutch/Nutch-1.0-dev+(A+Nutch-based+crawler.;+http://lucene.apache.org/nutch/bot.html;+nutch-agent+AT+lucene.apache.org)

robots.txt? Yes - ignored it.

Went after Homepage and left with a fat 403.

dstiles




msg:4087464
 11:54 pm on Feb 25, 2010 (gmt 0)

From the ddanchev security blog re: spamvertised casino games serving up viruses:

"What's particularly interesting about the campaign, is the fact that all of the domains serve identical template, with the (virusname) binary hosted 'in the cloud' thanks to Amazon's Web Services."

Pfui




msg:4088770
 5:40 am on Feb 28, 2010 (gmt 0)

ec2-174-129-65-46.compute-1.amazonaws.com
Mozilla/5.0 [Internet Explorer]

robots.txt? NO

Pfui




msg:4089719
 4:33 am on Mar 2, 2010 (gmt 0)

ec2-67-202-4-244.compute-1.amazonaws.com
Java/1.6.0_16

robots.txt? NO

Made a beeline for two dynamically generated files. HEAD requests.

Pfui




msg:4097259
 5:10 am on Mar 14, 2010 (gmt 0)

ec2-204-236-232-79.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:05 /robots.txt

ec2-184-73-40-108.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:06 /robots.txt

Pfui




msg:4101399
 3:18 am on Mar 20, 2010 (gmt 0)

ec2-75-101-172-69.compute-1.amazonaws.com
Ruby EventMachine

robots.txt? NO

Pfui




msg:4102332
 2:36 pm on Mar 22, 2010 (gmt 0)

ec2-67-202-36-161.compute-1.amazonaws.com
JS-Kit URL Resolver, http://js-kit.com/

robots.txt? NO

Pfui




msg:4112562
 7:45 pm on Apr 8, 2010 (gmt 0)

ec2-174-129-248-30.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.4) Gecko/2008102920 http://ow.ly web crawler (.NET CLR 3.5.30729)

robots.txt? NO

keyplyr




msg:4113612
 5:45 pm on Apr 10, 2010 (gmt 0)


ec2-75-101-149-212.compute-1.amazonaws.com
TrueKnowledgeBot (http://www.trueknowledge.com/tkbot/; tkbot -AT- trueknowledge _dot_ com)

robots.txt: yes

Pfui




msg:4118561
 4:39 am on Apr 20, 2010 (gmt 0)

Twitter-swarmer:

ec2-204-236-201-131.compute-1.amazonaws.com
justsignal/1.0 (+http://justsignal.com)

robots.txt? NO
Fake ref/Log spam? YES, the UA's Host

Pfui




msg:4118597
 6:54 am on Apr 20, 2010 (gmt 0)

Twitter-swarming 'web miner':

ec2-174-129-76-46.compute-1.amazonaws.com
Mozilla/5.0 (compatible; ptd-crawler; +http://bixolabs.com/crawler/ptd/; crawler@bixolabs.com)

robots.txt? Yes

Pfui




msg:4121483
 11:01 pm on Apr 24, 2010 (gmt 0)

ec2-204-236-236-136.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Firefox Addon; Windows XP 5.1)

robots.txt? NO

Four hits to / in five seconds (403s ignored).

Pfui




msg:4125206
 9:33 pm on Apr 30, 2010 (gmt 0)

ec2-72-44-62-4.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3

robots.txt? Yes

Pfui




msg:4130938
 5:31 pm on May 11, 2010 (gmt 0)

ec2-174-129-75-189.compute-1.amazonaws.com
research-scan-bot/Nutch-1.0

robots.txt? Yes

Pfui




msg:4131016
 7:21 pm on May 11, 2010 (gmt 0)

Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO

Pfui




msg:4133566
 3:08 pm on May 16, 2010 (gmt 0)

ec2-184-73-120-39.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)

robots.txt? NO

rb2k




msg:4144544
 10:51 pm on May 31, 2010 (gmt 0)

Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO


Hey,
I'm the developer of the current acquia crawler. It actually has no shared code with the "old" one.
I'm sorry that it didn't check the robots.txt first. It SHOULD check the robots.txt first (and in general only look at the front page ("/") of the website).
It would be nice if you could PM me the URL of your site so I can see what might have caused this.

Pfui




msg:4156141
 5:17 pm on Jun 21, 2010 (gmt 0)

Real Yahoo doubtful, double-cloak, or double-cross? You be the judge...

ec2-184-73-3-35.compute-1.amazonaws.com
Mozilla/5.0 [en] (X11; U; Linux 2.2.15 i686 +http://www.yahoo.com/index.html)

robots.txt? Yes, but promptly ignored.

06/21 02:30:13 /robots.txt
06/21 02:30:17 /

tangor




msg:4156697
 9:31 am on Jun 22, 2010 (gmt 0)

Pfui... you are our resident aws expert. Have you considered updating your original post with new info? 168 messages to wade through... well, yeah, I suppose that's why we're here... the interaction. Yet, the issue is getting out of hand! As Boris Karloff used to say: BWAHAHAHAHA!

Pfui




msg:4167566
 3:32 am on Jul 10, 2010 (gmt 0)

tangor, long story short:

Block/deny/rewrite .amazonaws.com by name

: )

And now, from the Department of Tweet ReReReRedundancy Department:

ec2-184-73-67-107.compute-1.amazonaws.com
UA: NONE
robots.txt? NO

Req: HEAD
Fake Ref: @hourlypress

(Tweets upon retweets upon reretweets remind me more and more of e-mailed chain letter plagues.)

trader




msg:4167616
 8:49 am on Jul 10, 2010 (gmt 0)

I must be missing something because I still don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?

Amazonaws.com is in almost all of my sites referral logs with lots of visits day after day, week after week, and month after month.

Why in the world does Amazon want to visit my sites in the first place? For what purpose? How do they know about my url's, especially the new sites? Is it somehow connected to the amazon.com affiliate program (however, the many sites they visit are not using the affiliate program and never did)?

Sometimes they visit a brand new site I just put online even before anyone else does or the site gets indexed anywhere.

How would Amazon know that my url was just put online? Sometimes they are my #1 traffic source! How and why is this happening? Please excuse my lack of knowledge on this subject.

keyplyr




msg:4167780
 6:21 pm on Jul 10, 2010 (gmt 0)

@ trader

This thread is speaking of amazonaws.com, a virtual computing service offered by Amazon. The actual owners of these agents that come to your web sites are customers of Amazon.

I suggest reading this entire thread as well as doing some searches to gain a better understanding of what happens at a cloud service. There are others besides amazonaws.com.

trader




msg:4167814
 7:45 pm on Jul 10, 2010 (gmt 0)

Thanks. I did read or scan most of the thread (but often too technical for me) but still do not understand why they are visiting my sites so much and what they have to gain by all the hits? Why is it worth the effort?

Also confused about how they even know about my websites in the first place (an issue I do not see answered) especially knowledge of my brand new site URL's just developed, and not indexed anywhere yet. Sometimes I see amazonaws.com in my log files as one of the first visitors to my new site just put online the day before!

Sorry I am not too educated on this or about cloud service which I never heard of before you mentioned it. P.S. I am not a programmer or a geek (obviously).

Pfui




msg:4167838
 9:04 pm on Jul 10, 2010 (gmt 0)

In a nutshell: "amazonaws.com" is an alias used by a multitude of anonymous AWS customers.

Amazon, like others, offers 'cloud' computing services where companies/people rent virtual server space. This thread is testament to the stunning number of bad robots/crawlers/crap apps hiding in the amazonaws.com cloud.

And those "amazonaws.com" hits you're seeing? Someone's using your resources for secret reasons.

dstiles




msg:4167846
 10:07 pm on Jul 10, 2010 (gmt 0)

To add to the list of bots from aws:

NetcraftSurvey, previously running from UK BT block 194.72.238.0/24 is now on aws with a new bot UA. At least, I think it's the same service: the new one includes netcraft.com in the UA and the old one had a referer of netcraft.com - they may serve different functions.

New UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)

It's possible the old one is still running. First time I saw this was today.

trader




msg:4168092
 3:29 pm on Jul 11, 2010 (gmt 0)

Thanks Pfui, but what in the world could some of the the "secret reasons" possibly be? Maybe to send spam is one of them I am guessing?

We are aware of lots of spams being sent using disguised headers and claiming it's from me and from a (far from obvious unique name that could not be guessed) email address configured on the server.

However, those email addresses are not shown or listed on the web anywhere but only created when the site was setup on the server because the setup process wanted an email address to be entered (though we never used it or intended to use it).

Pfui




msg:4168102
 4:39 pm on Jul 11, 2010 (gmt 0)

Probably not quite so nefarious. I reckon secret reasons are more likely trade secret-related -- companies/people data-mining to build or improve their hoped-for better mousetraps.

I get why they hide. But when they use/abuse my stuff, I want WHOIS, if not Who, What, When, Where, Why. Because, heck, if their aims aren't nefarious, why not?

(A bit OT: Spam-wise, faked addresses can be from anybody anywhere. But with yours that private to begin with... I dunno. Perhaps your server was more vulnerable at some point that it is now?)

dstiles




msg:4168173
 8:49 pm on Jul 11, 2010 (gmt 0)

My experience of AWS is that it hosts a myriad of site scrapers or trivial bots (eg dozens of twitter/facebook me-too bots) that have no obvious use and no definite IP to block if we don't like them. My logs are full of the things, all blocked. Many of them pretend to be browsers, although certain characteristics belie that.

If a bot can't identify itself properly and supply a fixed address it's toast! I don't care who or what it is.

Pfui




msg:4168872
 1:04 am on Jul 13, 2010 (gmt 0)

A new addition to the post-Tweet swarm:

ec2-174-129-180-142.compute-1.amazonaws.com
my6sense/1.0

robots.txt? NO

Pfui




msg:4168892
 1:36 am on Jul 13, 2010 (gmt 0)

What a waste of log space. All post-Tweet HEAD hits to the exact same file. (403s obviously ignored.) No robots.txt 'natch -- or ever, re most UAs from amazonaws.

ec2-204-236-206-79.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:45:24
07/12 17:45:34
07/12 17:45:35
07/12 17:45:36

ec2-204-236-254-109.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:48:01
07/12 17:48:07
07/12 17:48:08
07/12 17:48:09
07/12 17:54:41
07/12 17:54:58
07/12 17:55:16
07/12 17:55:39
07/12 17:56:28
07/12 17:57:07
07/12 17:57:48
07/12 17:58:35

incrediBILL




msg:4171685
 5:20 pm on Jul 16, 2010 (gmt 0)

I get why they hide.


I don't think most are hiding.

AWS cloud services just happen to be cheap and fast, something a startup company with limited resources would find appealing.

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 [6] 7 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved