Welcome to WebmasterWorld Guest from 54.211.227.36

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

amazonaws.com plays host to wide variety of bad bots

Most recently seen: Gnomit

     
3:04 am on Jan 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2038
votes: 1


ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

11:55 pm on Jan 20, 2009 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14662
votes: 95


At this point amazonaws.com is synonymous with bad activity so blocking them becomes the de facto standard.

The only concern is what happens when the "next big thing" uses their cloud computing facility and those of us blocking it because of all the bad bot noise totally miss out on being in the Google killer until it's too late?

1:16 am on Jan 21, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Pfui,

Very nice list!

But for clarity, what is the meaning of the "Robots.txt? Yes/No" entries above?

The user-agent fetches robots.txt
-or-
The user-agent fetches and obeys robots.txt

Thanks,
Jim

9:19 pm on Jan 29, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Sorry for the belated reply. (Clicking 'Email notification of new replies' hasn't worked for me for a long time.)

-----
Jim:

"robots.txt? YES" <== Asked for it
"robots.txt? NO" <== Didn't ask

To my recollection, most that asked for robots.txt honored it. I didn't really pay attention because that's the only file I let unauthorized bots -- and amazonaws.com, the host -- access.

By the way, here's another one. If this were an eBay item, they'd get yanked for keyword-spamming the spider's brand!

Nutch/Nutch-1.0-dev (A Nutch-based crawler.; [lucene.apache.org...] nutch-agent AT lucene.apache.org)
robots.txt? YES

-----
Bill:

At this point, I don't know what I think about AWS, other than I'm amazed at (& irked by) all of its/their bots. (Plus their CloudFront info is so PC-centric that this Mac/Apache person's eyes just glaze over.) Why spider constantly, I wonder? Are they trying to rival Google? What are they doing with all of the data? Where are they going with all of their really geeky doodads?

Oh, to be a fly on the wall in Bezos's office, eh?

-----
All:

What are your thoughts on their Web Services? Are you intrigued? Have you tested anything and noticed more spidering? Their cookies systems are fantastically inbred -- notice anything new in that dept.?

What I do know is that between amazon.com and imdb.com and amazonfresh.com (fresh.amazon.com) and amazonaws.com (aws.amazon.com), they're all over the map. Literally. And their spiders are squished all over my server. Literally.

: )

[edited by: Pfui at 9:21 pm (utc) on Jan. 29, 2009]

10:23 pm on Jan 29, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


it's ec, so they're charged by traffic and computing time, right?
let them pay. if you discover they're spidering your site and you can spare the traffic, just send big, complex files. they need some form of parser, and handling big documents will slow their crawlers down and cost 'em money. if they' using a regexp to retrieve urls to crawl, it won't be that funny, but if they're using a stack based parser and you sent them a document with thousands of stacked nodes, it might get funny ;)

generally, I'm using amazon-webservices (the alexa information thingy) and the ecommerce-webservice, though both of them not that heavy (my bill averages to $3 / month). I usually don't mind bot traffic (traffic is free and my servers can easily handle the load right now) if the bot behaves. If it doesn't, I don't care if it's an official bot by amazon or just one that's hosted on their platform, it get's blocked, most likely on iptables, so I don't have all those 403-errors in my logs.

8:22 pm on Feb 6, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Today's new ec2-[yada-yada].compute-1.amazonaws.com bot, and/or its botrunner, was more rude than most. No robots.txt, then hit 11 times in 5 visits over 4 hours (thus far). Perhaps not so coincidentally, Every. Single. Hit. was to an ALL-robot restricted page, post, or zip.

Mozilla/5.0 (compatible; kmky-not-a-bot/0.2; [kilomonkey.com...] )
robots.txt? NO

Even the bot's home page is rude ('notabot.txt' timed out): "This Server contains no useful information. The Domain name is not for sale. Goodbye."

Well, 403 back atcha, not-notabot.

(Hmm... That domain's listed to Kilo Monkey LLC, and an Aaron Flin -- of Dogpile metasearch fame?)

12:05 am on Feb 8, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


My simple solution was to ban all Amazon IP Addresses. There's a thread here somewhere with all their net ranges that someone was kind enough to post for me. Problem solved! :)
2:02 pm on Feb 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


That'd work, unless you're an Amazon associate/affiliate. Then banning all of their IPs can be problematic. From:

Amazon.com Associates Central - Help
No relevant products are showing on my page. Why is this the case?

"You must allow our spider (Mozilla/5.0 (compatible; >> AMZNKAssocBot/4.0)) to crawl your website. The crawl is needed in order to identify the content of your website and provide matching products. If you do not allow our spider to crawl your website(s) we will display selected products from our product lines in the Omakase Links."

FWIW, I've found this name works A-OK in rewrites:

Mozilla.*AMZNKAssocBot

Oh, and coming up, Yet Another AWS Bot. (If nothing else, this thread provides a comprehensive list of their automatons, eh? :)

2:04 pm on Feb 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)

robots.txt? NO

2:45 pm on Feb 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1883
votes: 55


-- If your web site doesn't want to be crawled, please send us a email --

Hmmm,

Let me ask it, hold on, brb.... cool I am back! I asked and it is not saying anything, I even tried a "pretty please" thingy, still not talking to me, and I am the owner. I think I'll have to learn that new Telepathic computer language that enables humans to talk to text files, the one that everybody is talking about you know...

It's nice and shiny though ;)

7:25 pm on Feb 11, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

3:25 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (botmobi find.mobi/bot.html find@mtld.mobi)/

robots.txt? YES

4:37 pm on Feb 12, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Incredibly, still more newcomers courtesy of .compute-1.amazonaws.com:

UA: -
robots.txt? NO

Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
robots.txt? NO

Python-urllib/2.4
robots.txt? NO

ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
robots.txt? YES

P.S./FWIW:

In addition to owning AmazonAWS.com, IMDb.com, A9.com, and who knows what else, Amazon.com owns Archive.org (WayBackMachine; Internet Archive) a.k.a. Alexa.com a.k.a. ia_archiver and ia_archiver-web.

9:54 pm on Feb 15, 2009 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:180
votes: 1


Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

You may write whatever you want into your robots.txt.
User-agent: *
Disallow: /

doesn't prevent requesting the root page.
11:49 pm on Feb 15, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


I'm not sure I understand your point, sorry.

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.

But a blanket "Disallow: /" means Do Not Crawl Here. Go Away. Now. And that Disallow includes the root page because it's in the /rootdir. Even if the root page retrieval is basically simultaneous with that of robots.txt (as is often the case), there still should be no caching or referencing of the root page's data.

Yeah. And if wishes were horses... :)

12:32 pm on Feb 16, 2009 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:180
votes: 1


Sorry. What I tried to write:
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)
reads robots.txt, but doesn't care about the contents of robots.txt.

To my recollection, most that asked for robots.txt honored it.
Really bad bots request for robots.txt in order to get into the dark web and to confuse webmasters ("robots.txt? YES").

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.
Requesting my robots.txt leads to a site-wide ban.
3:16 pm on Feb 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Thanks for clarifying! A follow-up re:

"Requesting my robots.txt leads to a site-wide ban."

I'm curious as to how you do that, and also why? There are some bots that actually honor it:)

That said, before I figured out robots.cgi, I was really reluctant to let any bot read a directory-detailed robots.txt, even a baited one. Now, the riffraff only see Disallow: /

FWIW, I allow all to access robots.txt but other than the Big 3, I block access to everything other than robots.txt, just in case. Additionally, the Big 3 are kept out of numerous directories via subdir .htaccess --

Whoa.

As I write this, I'm again whelmed by what a time-and-effort drain it is keeping the ramparts intact. An obsessively intriguing drain, but a drain nonetheless. And all of it unseen by site owners!

Anyway, getting back to AWS's bevy o' bots, Yet Another:

3:16 pm on Feb 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; [similarpages.com;...] info at similarpages dot com)

robots.txt? YES

5:07 pm on Feb 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


(So much for even bothering to use a bot-related UA...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

robots.txt? NO

4:35 pm on Feb 26, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


(And the hits just keep on coming...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

robots.txt? NO

7:24 am on Mar 2, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts:379
votes: 0


New one for me...

UA: goroam/1.0-SNAPSHOT (goraom geo crawler; [goroam.net...] info@goroam.net)

Requested robots.txt - Yes
Obeyed it - Yes (I white-list bots, so all but a tiny few are banned)

Came from: 67.202.0.n

1:50 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

robots.txt? NO

4:49 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


ec2-[yada-yada].compute-1.amazonaws.com
AideRSS 2.0 (postrank.com)

robots.txt? NO

5:06 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Question: Are there any legitimate services using Amazon/EC2 that would prevent one from simply banning all Amazon/EC2 net ranges?
7:19 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


Are there any legitimate services using Amazon/EC2

Legitimate, probably.

Worthwhile for webmasters, almost certainly not.

I block the whole lot and lose no sleep.

...

7:26 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


I do the same, Sam. At least I do on most of my sites. Cowbot slipped through on one of my domains last week and it's now operating from EC2. I just started a new thread about it.
9:21 pm on Mar 9, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2008
posts:55
votes: 0


Sam, Gary,

Including AMAZON-01/03, AES and NET ranges?.

Cheers,
Phred

9:32 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Yep.

[edited by: GaryK at 9:32 pm (utc) on Mar. 9, 2009]

11:59 pm on Mar 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


ec2 -- Elastic Compute Cloud -- is just another name for server farm. And its anonymous clients are just as intrusive as those cloaked behind privatedns and secureserver others of their ilk. (imho)
12:31 am on Mar 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Pfui,

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

could be a page thumbnailer such as that used by Ask.com search for their "preview" function in search results.


GaryK,

Alexa/Internet Archiver uses Elastic Cloud Compute services.

So you may want to allow those sub-ranges of ECS.

Jim

This 278 message thread spans 10 pages: 278
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members