homepage Welcome to WebmasterWorld Guest from 54.242.126.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 [6] 7 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 11:54 pm on Feb 18, 2010 (gmt 0)

ec2-184-73-16-198.compute-1.amazonaws.com
Nutch/Nutch-1.0-dev+(A+Nutch-based+crawler.;+http://lucene.apache.org/nutch/bot.html;+nutch-agent+AT+lucene.apache.org)

robots.txt? Yes - ignored it.

Went after Homepage and left with a fat 403.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 11:54 pm on Feb 25, 2010 (gmt 0)

From the ddanchev security blog re: spamvertised casino games serving up viruses:

"What's particularly interesting about the campaign, is the fact that all of the domains serve identical template, with the (virusname) binary hosted 'in the cloud' thanks to Amazon's Web Services."

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:40 am on Feb 28, 2010 (gmt 0)

ec2-174-129-65-46.compute-1.amazonaws.com
Mozilla/5.0 [Internet Explorer]

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:33 am on Mar 2, 2010 (gmt 0)

ec2-67-202-4-244.compute-1.amazonaws.com
Java/1.6.0_16

robots.txt? NO

Made a beeline for two dynamically generated files. HEAD requests.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:10 am on Mar 14, 2010 (gmt 0)

ec2-204-236-232-79.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:05 /robots.txt

ec2-184-73-40-108.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)

03/13 17:58:06 /robots.txt

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:18 am on Mar 20, 2010 (gmt 0)

ec2-75-101-172-69.compute-1.amazonaws.com
Ruby EventMachine

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:36 pm on Mar 22, 2010 (gmt 0)

ec2-67-202-36-161.compute-1.amazonaws.com
JS-Kit URL Resolver, http://js-kit.com/

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:45 pm on Apr 8, 2010 (gmt 0)

ec2-174-129-248-30.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.4) Gecko/2008102920 http://ow.ly web crawler (.NET CLR 3.5.30729)

robots.txt? NO

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 5:45 pm on Apr 10, 2010 (gmt 0)


ec2-75-101-149-212.compute-1.amazonaws.com
TrueKnowledgeBot (http://www.trueknowledge.com/tkbot/; tkbot -AT- trueknowledge _dot_ com)

robots.txt: yes

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:39 am on Apr 20, 2010 (gmt 0)

Twitter-swarmer:

ec2-204-236-201-131.compute-1.amazonaws.com
justsignal/1.0 (+http://justsignal.com)

robots.txt? NO
Fake ref/Log spam? YES, the UA's Host

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 6:54 am on Apr 20, 2010 (gmt 0)

Twitter-swarming 'web miner':

ec2-174-129-76-46.compute-1.amazonaws.com
Mozilla/5.0 (compatible; ptd-crawler; +http://bixolabs.com/crawler/ptd/; crawler@bixolabs.com)

robots.txt? Yes

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 11:01 pm on Apr 24, 2010 (gmt 0)

ec2-204-236-236-136.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Firefox Addon; Windows XP 5.1)

robots.txt? NO

Four hits to / in five seconds (403s ignored).

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 9:33 pm on Apr 30, 2010 (gmt 0)

ec2-72-44-62-4.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3

robots.txt? Yes

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:31 pm on May 11, 2010 (gmt 0)

ec2-174-129-75-189.compute-1.amazonaws.com
research-scan-bot/Nutch-1.0

robots.txt? Yes

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:21 pm on May 11, 2010 (gmt 0)

Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:08 pm on May 16, 2010 (gmt 0)

ec2-184-73-120-39.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)

robots.txt? NO

rb2k



 
Msg#: 3828718 posted 10:51 pm on May 31, 2010 (gmt 0)

Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO


Hey,
I'm the developer of the current acquia crawler. It actually has no shared code with the "old" one.
I'm sorry that it didn't check the robots.txt first. It SHOULD check the robots.txt first (and in general only look at the front page ("/") of the website).
It would be nice if you could PM me the URL of your site so I can see what might have caused this.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:17 pm on Jun 21, 2010 (gmt 0)

Real Yahoo doubtful, double-cloak, or double-cross? You be the judge...

ec2-184-73-3-35.compute-1.amazonaws.com
Mozilla/5.0 [en] (X11; U; Linux 2.2.15 i686 +http://www.yahoo.com/index.html)

robots.txt? Yes, but promptly ignored.

06/21 02:30:13 /robots.txt
06/21 02:30:17 /

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 9:31 am on Jun 22, 2010 (gmt 0)

Pfui... you are our resident aws expert. Have you considered updating your original post with new info? 168 messages to wade through... well, yeah, I suppose that's why we're here... the interaction. Yet, the issue is getting out of hand! As Boris Karloff used to say: BWAHAHAHAHA!

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:32 am on Jul 10, 2010 (gmt 0)

tangor, long story short:

Block/deny/rewrite .amazonaws.com by name

: )

And now, from the Department of Tweet ReReReRedundancy Department:

ec2-184-73-67-107.compute-1.amazonaws.com
UA: NONE
robots.txt? NO

Req: HEAD
Fake Ref: @hourlypress

(Tweets upon retweets upon reretweets remind me more and more of e-mailed chain letter plagues.)

trader

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 8:49 am on Jul 10, 2010 (gmt 0)

I must be missing something because I still don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?

Amazonaws.com is in almost all of my sites referral logs with lots of visits day after day, week after week, and month after month.

Why in the world does Amazon want to visit my sites in the first place? For what purpose? How do they know about my url's, especially the new sites? Is it somehow connected to the amazon.com affiliate program (however, the many sites they visit are not using the affiliate program and never did)?

Sometimes they visit a brand new site I just put online even before anyone else does or the site gets indexed anywhere.

How would Amazon know that my url was just put online? Sometimes they are my #1 traffic source! How and why is this happening? Please excuse my lack of knowledge on this subject.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 6:21 pm on Jul 10, 2010 (gmt 0)

@ trader

This thread is speaking of amazonaws.com, a virtual computing service offered by Amazon. The actual owners of these agents that come to your web sites are customers of Amazon.

I suggest reading this entire thread as well as doing some searches to gain a better understanding of what happens at a cloud service. There are others besides amazonaws.com.

trader

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 7:45 pm on Jul 10, 2010 (gmt 0)

Thanks. I did read or scan most of the thread (but often too technical for me) but still do not understand why they are visiting my sites so much and what they have to gain by all the hits? Why is it worth the effort?

Also confused about how they even know about my websites in the first place (an issue I do not see answered) especially knowledge of my brand new site URL's just developed, and not indexed anywhere yet. Sometimes I see amazonaws.com in my log files as one of the first visitors to my new site just put online the day before!

Sorry I am not too educated on this or about cloud service which I never heard of before you mentioned it. P.S. I am not a programmer or a geek (obviously).

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 9:04 pm on Jul 10, 2010 (gmt 0)

In a nutshell: "amazonaws.com" is an alias used by a multitude of anonymous AWS customers.

Amazon, like others, offers 'cloud' computing services where companies/people rent virtual server space. This thread is testament to the stunning number of bad robots/crawlers/crap apps hiding in the amazonaws.com cloud.

And those "amazonaws.com" hits you're seeing? Someone's using your resources for secret reasons.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 10:07 pm on Jul 10, 2010 (gmt 0)

To add to the list of bots from aws:

NetcraftSurvey, previously running from UK BT block 194.72.238.0/24 is now on aws with a new bot UA. At least, I think it's the same service: the new one includes netcraft.com in the UA and the old one had a referer of netcraft.com - they may serve different functions.

New UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)

It's possible the old one is still running. First time I saw this was today.

trader

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 3:29 pm on Jul 11, 2010 (gmt 0)

Thanks Pfui, but what in the world could some of the the "secret reasons" possibly be? Maybe to send spam is one of them I am guessing?

We are aware of lots of spams being sent using disguised headers and claiming it's from me and from a (far from obvious unique name that could not be guessed) email address configured on the server.

However, those email addresses are not shown or listed on the web anywhere but only created when the site was setup on the server because the setup process wanted an email address to be entered (though we never used it or intended to use it).

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:39 pm on Jul 11, 2010 (gmt 0)

Probably not quite so nefarious. I reckon secret reasons are more likely trade secret-related -- companies/people data-mining to build or improve their hoped-for better mousetraps.

I get why they hide. But when they use/abuse my stuff, I want WHOIS, if not Who, What, When, Where, Why. Because, heck, if their aims aren't nefarious, why not?

(A bit OT: Spam-wise, faked addresses can be from anybody anywhere. But with yours that private to begin with... I dunno. Perhaps your server was more vulnerable at some point that it is now?)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 8:49 pm on Jul 11, 2010 (gmt 0)

My experience of AWS is that it hosts a myriad of site scrapers or trivial bots (eg dozens of twitter/facebook me-too bots) that have no obvious use and no definite IP to block if we don't like them. My logs are full of the things, all blocked. Many of them pretend to be browsers, although certain characteristics belie that.

If a bot can't identify itself properly and supply a fixed address it's toast! I don't care who or what it is.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 1:04 am on Jul 13, 2010 (gmt 0)

A new addition to the post-Tweet swarm:

ec2-174-129-180-142.compute-1.amazonaws.com
my6sense/1.0

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 1:36 am on Jul 13, 2010 (gmt 0)

What a waste of log space. All post-Tweet HEAD hits to the exact same file. (403s obviously ignored.) No robots.txt 'natch -- or ever, re most UAs from amazonaws.

ec2-204-236-206-79.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:45:24
07/12 17:45:34
07/12 17:45:35
07/12 17:45:36

ec2-204-236-254-109.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)
07/12 17:48:01
07/12 17:48:07
07/12 17:48:08
07/12 17:48:09
07/12 17:54:41
07/12 17:54:58
07/12 17:55:16
07/12 17:55:39
07/12 17:56:28
07/12 17:57:07
07/12 17:57:48
07/12 17:58:35

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 5:20 pm on Jul 16, 2010 (gmt 0)

I get why they hide.


I don't think most are hiding.

AWS cloud services just happen to be cheap and fast, something a startup company with limited resources would find appealing.

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 [6] 7 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved