homepage Welcome to WebmasterWorld Guest from 50.17.7.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 [5] 6 7 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

Pfui




msg:4061855
 4:08 am on Jan 16, 2010 (gmt 0)

Long story short? Block .amazonaws.com :)

IP range-wise, the IPs are 'in' the Host names.* Now as to how many there are, let alone what they are, I'm sorry but I'll have to leave that compilation as a sweat equity exercise for the bot-curious/obsessed at this time. Suffice it to say that akin to any country -- and numbering more than many countries'! -- Amazon's cloud-related IPs are neither contiguous nor non-expanding.

.
*The second post in the MetaURI [webmasterworld.com] thread shows more detail, including an atypical example of the same UA using the exact same AWS IP over a period of time.

Pfui




msg:4062180
 11:17 pm on Jan 16, 2010 (gmt 0)

Emphasis mine. See link below for more info:

ec2-174-129-120-104.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
(for IE)

robots.txt? NO
URI: /favicon.ico
Related: Yahoo's cloaked crawler(s) [webmasterworld.com]

Pfui




msg:4062445
 4:41 pm on Jan 17, 2010 (gmt 0)

ec2-174-129-237-42.compute-1.amazonaws.com
OpenCalaisSemanticProxy

robots.txt? NO

Pfui




msg:4062932
 4:55 pm on Jan 18, 2010 (gmt 0)

ec2-204-236-247-88.compute-1.amazonaws.com
@hourlypress

robots.txt? NO

Twitter-related.

Pfui




msg:4062992
 6:39 pm on Jan 18, 2010 (gmt 0)

Two hits 20 seconds apart; UA not too cleverly cloaked:

ec2-75-101-147-15.compute-1.amazonaws.com
Firefox

robots.txt? NO

trader




msg:4064036
 6:02 am on Jan 20, 2010 (gmt 0)

Just found this long thread. Don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?

I see amazonaws.com in many of my sites referral logs with lots of visits week after week and month after month.

Are they owned by amazon.com? Why do they want to visit my sites in the first place? How do they know about my url's? Sometimes they visit even before anyone else does or the site has time to get listed in Google.

How would they even know my url was just put online? Sometimes they are my #1 traffic source with both newer and older sites. How and why is this happening? Who are they?

keyplyr




msg:4064056
 7:02 am on Jan 20, 2010 (gmt 0)

Just found this long thread. Don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?

Read the thread. It's explained.

Pfui




msg:4064604
 11:20 pm on Jan 20, 2010 (gmt 0)

Long thread short, the title tells it all:

amazonaws.com plays host to wide variety of bad bots

That statement was true exactly one year+three days ago when I started this thread. Now, 125-plus posts later, and scores and scores and scores of bad, rude, iffy, test, and/or worthless bot hits later, that statement's an understatement.

So if your main traffic source is amazonaws.com-related, I'm sorry but none of that resource-eating traffic -- NONE of it -- is real people visiting in real time.

Block amazonaws.com and you'll stop feeding its plague of locusts.

dstiles




msg:4065128
 5:14 pm on Jan 21, 2010 (gmt 0)

Not sure how this fits into the Amazon scenario:

IP: 216.113.169.nnn
Standard Headers: Accept All only
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.2a1pre) Gecko/20090402 Firefox/3.6a1pre (.NET CLR 3.5.30729)
Referer: http:// www. google. com
HTTP_CONNECTION: Keep-Alive
Robots: haven't checked

The hit (home page of one site only) was trapped on several points that usually indicate an exploit attempt.

eBay, Inc
OrgID: EBAY
NetRange: 216.113.160.0 - 216.113.191.255

Pfui




msg:4065242
 8:13 pm on Jan 21, 2010 (gmt 0)

Hmm. I may be missing something but I don't see a connection between your info and amazonaws.com or anything Amazon.

If you suspect a cloaked/rogue spider, you could start a new thread with more details about the visit. Alternatively, It could've just been a zombied machine belonging to an eBay employee.

dstiles




msg:4065286
 9:48 pm on Jan 21, 2010 (gmt 0)

Hmm. I'll just go and bury my head in a bucket, then, shall I?

Not being a user of either, I totally confused Amazon and Ebay. Sorry. :(

Pfui




msg:4065417
 2:15 am on Jan 22, 2010 (gmt 0)


Alas, because amazonaws.com is a never-ending source of new bad bots, I suspect this thread -- and my many bot-sightings in it -- will continue (...ad nauseam, sorry:)

bedlam




msg:4068421
 8:55 pm on Jan 26, 2010 (gmt 0)

Just a note about a bot identifying itself as PostRank. Somebody above mentioned that it made one HEAD request and then left.

However, I've just seen that bot (same AWS IP range) request /wp-admin/install.php. I can't see any good reason for a bot to want that...

-- b

dstiles




msg:4068538
 10:55 pm on Jan 26, 2010 (gmt 0)

I see a lot of php accesses through folders such as admin. All, as far as I've seen, are attempts to gain access to the server through faults (eg no password) in admin whatever files, often for phpmyadmin or other control panels.

As I noted elswhere (I think this thread): I have seen AWS used either directly (via logged on account) or indirectly (via infected servers) for "standard botnet" exploit attempts.

tangor




msg:4068657
 2:06 am on Jan 27, 2010 (gmt 0)

The enormous number of "php" requests via the discussed ip range(s) is why I've been intrigued, and why I finally nuked it a week or so back. I could have approached it the other way around via referer and accomplished same thing, but life is too short and there's too many rogue bots on that service.

Pfui




msg:4070839
 2:25 am on Jan 30, 2010 (gmt 0)

ec2-75-101-243-212.compute-1.amazonaws.com
curl/7.18.2 (i486-pc-linux-gnu) libcurl/7.18.2 OpenSSL/0.9.8g zlib/1.2.3.3 libidn/1.8 libssh2/0.18

robots.txt? NO

Pfui




msg:4073031
 9:09 pm on Feb 2, 2010 (gmt 0)

ec2-204-236-242-36.compute-1.amazonaws.com
Mozilla/5.0 (compatible; spbot/1.0; +http://www.seoprofiler.com/bot/ )

robots.txt? Yes

[Above info also posted as new thread.]

Curiously, per WHOIS, the bot-runner's site appears to be hosted by Amazonaws.com:

seoprofiler.com => 174.129.8.145 => Amazon.com/amazonaws.com
[Amazon Web Services, Elastic Compute Cloud, EC2]

Interesting how a company can claim a dynamically assigned IP as its permanent address...

thetrasher




msg:4075879
 7:13 pm on Feb 7, 2010 (gmt 0)

New: AMAZON-EC2-7 = 184.72.0.0/15

dstiles




msg:4075914
 8:21 pm on Feb 7, 2010 (gmt 0)

Well spotted! My very first ban of the 184 range. :)

tpeacock




msg:4078220
 10:57 am on Feb 11, 2010 (gmt 0)

These are the CIDR blocks I have for amazonaws's bad bots at this point. Does this seem to cover all of them or are there more?

67.202.0.0/18 # "do not delete" - amazonaws.com's bad bots
75.101.128.0/17 # "do not delete" - amazonaws.com's bad bots
79.125.0.0/18 # "do not delete" - amazonaws.com's bad bots - Ireland
174.129.0.0/16 # "do not delete" - amazonaws.com's bad bots
184.72.0.0/15 # "do not delete" - amazonaws.com's bad bots
204.236.128.0/17 # "do not delete" - amazonaws.com's bad bots

Thomas

thetrasher




msg:4078296
 1:11 pm on Feb 11, 2010 (gmt 0)

+
216.182.224.0/20
72.44.32.0/19

Lain_se




msg:4078570
 7:06 pm on Feb 11, 2010 (gmt 0)

My laundry list of ec2 trash includes the following.

deny from 67.202.0.0/18 "Amazon ec2-Cloud"
deny from 72.44.32.0/19 "Amazon ec2-Cloud"
deny from 75.101.128.0/17 "Amazon ec2-Cloud"
deny from 79.125.0.0/18 "Amazon ec2-Cloud"
deny from 174.129.0.0/16 "Amazon ec2-Cloud"
deny from 184.72.0.0/15 "Amazon ec2-Cloud"
deny from 204.74.108.0/24 "Amazon ec2-Cloud"
deny from 204.236.128.0/17 "Amazon ec2-Cloud"
deny from 204.74.108.0/24 "Amazon ec2-Cloud"

I have seen virtually every real and fake user agent included and NEVER once have I been able to figure out why I should allow them to index my site. I spoke with Amazon Services about a year ago and explained to them why I am blocking them and requested that they force a user-ID tag of some sort to identify the user for abuse reasons and they explained to me that this will never happen....and so I stated that's too bad and your services will always be blocked as well. I honestly do not think they care and do not understand that they are only enabling web scrapers, spammers and other trash to ruin their integrity. Again I do not think they care so long as they are getting paid.

keyplyr




msg:4078601
 8:04 pm on Feb 11, 2010 (gmt 0)

CLOUD: Creepy Litigious Outrageous User-agent Dwelling

dstiles




msg:4078615
 8:30 pm on Feb 11, 2010 (gmt 0)

I've got the Ireland one 79.125.0.0/127

204.74.108.0/24 (which you list twice) resolves here to:
UltraDNS Corp ULTRADNS-GLOBAL-2
204.74.96.0 - 204.74.108.255
108 seems to be mostly unused apart from loads of name servers on 1.

tpeacock




msg:4080067
 4:44 pm on Feb 14, 2010 (gmt 0)

Thanks thetrasher for those 2 and dstiles the Ireland one was 79.125.0.0/17 not 18 like I had. I assumed you meant 17 not 127?

67.202.0.0/18 # "do not delete" - amazonaws.com's bad bots
72.44.32.0/19 # "do not delete" - amazonaws.com's bad bots
75.101.128.0/17 # "do not delete" - amazonaws.com's bad bots
79.125.0.0/17 # "do not delete" - amazonaws.com's bad bots - Ireland
174.129.0.0/16 # "do not delete" - amazonaws.com's bad bots
184.72.0.0/15 # "do not delete" - amazonaws.com's bad bots
204.236.128.0/17 # "do not delete" - amazonaws.com's bad bots
216.182.224.0/20 # "do not delete" - amazonaws.com's bad bots

That's a lot of IP addresses but I can not think of a single reason not to block them all.

Thomas

dstiles




msg:4080073
 5:27 pm on Feb 14, 2010 (gmt 0)

> I assumed you meant 17 not 127?

Sorry. It had been a long day. :)

Pfui




msg:4080665
 6:06 pm on Feb 15, 2010 (gmt 0)

ec2-75-101-245-135.compute-1.amazonaws.com
Twisted PageGetter

robots.txt? NO

See also: Twisted PageGetter [webmasterworld.com] (09/2009)

blend27




msg:4081004
 3:24 am on Feb 16, 2010 (gmt 0)

CLOUD: Creepy Litigious Outrageous User-agent Dwelling


CLOUD: 403

Pfui




msg:4082662
 1:00 am on Feb 18, 2010 (gmt 0)

ec2-174-129-153-217.compute-1.amazonaws.com
HTMLParser/2.0

robots.txt? NO

Pfui




msg:4082671
 1:15 am on Feb 18, 2010 (gmt 0)

ec2-174-129-167-253.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50728)

robots.txt? NO

What bothers me about this one is not just its non-bot UA, but that its hits only went to two html files and some of their graphics files, and knew just where to look before arriving. I've blocked amazonaws.com for over a year -- basically from the first day I saw it -- so the directory paths didn't come from my server. Hmm.

blend27




msg:4083220
 11:54 pm on Feb 18, 2010 (gmt 0)

ec2-184-73-16-198.compute-1.amazonaws.com
Nutch/Nutch-1.0-dev+(A+Nutch-based+crawler.;+http://lucene.apache.org/nutch/bot.html;+nutch-agent+AT+lucene.apache.org)

robots.txt? Yes - ignored it.

Went after Homepage and left with a fat 403.

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 [5] 6 7 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved