Welcome to WebmasterWorld Guest from 54.167.185.18

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

amazonaws.com plays host to wide variety of bad bots

Most recently seen: Gnomit

   
3:04 am on Jan 18, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

11:55 pm on Jan 20, 2009 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



At this point amazonaws.com is synonymous with bad activity so blocking them becomes the de facto standard.

The only concern is what happens when the "next big thing" uses their cloud computing facility and those of us blocking it because of all the bad bot noise totally miss out on being in the Google killer until it's too late?

1:16 am on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Pfui,

Very nice list!

But for clarity, what is the meaning of the "Robots.txt? Yes/No" entries above?

The user-agent fetches robots.txt
-or-
The user-agent fetches and obeys robots.txt

Thanks,
Jim

9:19 pm on Jan 29, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Sorry for the belated reply. (Clicking 'Email notification of new replies' hasn't worked for me for a long time.)

-----
Jim:

"robots.txt? YES" <== Asked for it
"robots.txt? NO" <== Didn't ask

To my recollection, most that asked for robots.txt honored it. I didn't really pay attention because that's the only file I let unauthorized bots -- and amazonaws.com, the host -- access.

By the way, here's another one. If this were an eBay item, they'd get yanked for keyword-spamming the spider's brand!

Nutch/Nutch-1.0-dev (A Nutch-based crawler.; [lucene.apache.org...] nutch-agent AT lucene.apache.org)
robots.txt? YES

-----
Bill:

At this point, I don't know what I think about AWS, other than I'm amazed at (& irked by) all of its/their bots. (Plus their CloudFront info is so PC-centric that this Mac/Apache person's eyes just glaze over.) Why spider constantly, I wonder? Are they trying to rival Google? What are they doing with all of the data? Where are they going with all of their really geeky doodads?

Oh, to be a fly on the wall in Bezos's office, eh?

-----
All:

What are your thoughts on their Web Services? Are you intrigued? Have you tested anything and noticed more spidering? Their cookies systems are fantastically inbred -- notice anything new in that dept.?

What I do know is that between amazon.com and imdb.com and amazonfresh.com (fresh.amazon.com) and amazonaws.com (aws.amazon.com), they're all over the map. Literally. And their spiders are squished all over my server. Literally.

: )

[edited by: Pfui at 9:21 pm (utc) on Jan. 29, 2009]

10:23 pm on Jan 29, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



it's ec, so they're charged by traffic and computing time, right?
let them pay. if you discover they're spidering your site and you can spare the traffic, just send big, complex files. they need some form of parser, and handling big documents will slow their crawlers down and cost 'em money. if they' using a regexp to retrieve urls to crawl, it won't be that funny, but if they're using a stack based parser and you sent them a document with thousands of stacked nodes, it might get funny ;)

generally, I'm using amazon-webservices (the alexa information thingy) and the ecommerce-webservice, though both of them not that heavy (my bill averages to $3 / month). I usually don't mind bot traffic (traffic is free and my servers can easily handle the load right now) if the bot behaves. If it doesn't, I don't care if it's an official bot by amazon or just one that's hosted on their platform, it get's blocked, most likely on iptables, so I don't have all those 403-errors in my logs.

8:22 pm on Feb 6, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Today's new ec2-[yada-yada].compute-1.amazonaws.com bot, and/or its botrunner, was more rude than most. No robots.txt, then hit 11 times in 5 visits over 4 hours (thus far). Perhaps not so coincidentally, Every. Single. Hit. was to an ALL-robot restricted page, post, or zip.

Mozilla/5.0 (compatible; kmky-not-a-bot/0.2; [kilomonkey.com...] )
robots.txt? NO

Even the bot's home page is rude ('notabot.txt' timed out): "This Server contains no useful information. The Domain name is not for sale. Goodbye."

Well, 403 back atcha, not-notabot.

(Hmm... That domain's listed to Kilo Monkey LLC, and an Aaron Flin -- of Dogpile metasearch fame?)

12:05 am on Feb 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My simple solution was to ban all Amazon IP Addresses. There's a thread here somewhere with all their net ranges that someone was kind enough to post for me. Problem solved! :)
2:02 pm on Feb 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



That'd work, unless you're an Amazon associate/affiliate. Then banning all of their IPs can be problematic. From:

Amazon.com Associates Central - Help
No relevant products are showing on my page. Why is this the case?

"You must allow our spider (Mozilla/5.0 (compatible; >> AMZNKAssocBot/4.0)) to crawl your website. The crawl is needed in order to identify the content of your website and provide matching products. If you do not allow our spider to crawl your website(s) we will display selected products from our product lines in the Omakase Links."

FWIW, I've found this name works A-OK in rewrites:

Mozilla.*AMZNKAssocBot

Oh, and coming up, Yet Another AWS Bot. (If nothing else, this thread provides a comprehensive list of their automatons, eh? :)

2:04 pm on Feb 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)

robots.txt? NO

2:45 pm on Feb 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- If your web site doesn't want to be crawled, please send us a email --

Hmmm,

Let me ask it, hold on, brb.... cool I am back! I asked and it is not saying anything, I even tried a "pretty please" thingy, still not talking to me, and I am the owner. I think I'll have to learn that new Telepathic computer language that enables humans to talk to text files, the one that everybody is talking about you know...

It's nice and shiny though ;)

7:25 pm on Feb 11, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

3:25 pm on Feb 12, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (botmobi find.mobi/bot.html find@mtld.mobi)/

robots.txt? YES

4:37 pm on Feb 12, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Incredibly, still more newcomers courtesy of .compute-1.amazonaws.com:

UA: -
robots.txt? NO

Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
robots.txt? NO

Python-urllib/2.4
robots.txt? NO

ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
robots.txt? YES

P.S./FWIW:

In addition to owning AmazonAWS.com, IMDb.com, A9.com, and who knows what else, Amazon.com owns Archive.org (WayBackMachine; Internet Archive) a.k.a. Alexa.com a.k.a. ia_archiver and ia_archiver-web.

9:54 pm on Feb 15, 2009 (gmt 0)

5+ Year Member



Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

You may write whatever you want into your robots.txt.
User-agent: *
Disallow: /

doesn't prevent requesting the root page.
11:49 pm on Feb 15, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I'm not sure I understand your point, sorry.

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.

But a blanket "Disallow: /" means Do Not Crawl Here. Go Away. Now. And that Disallow includes the root page because it's in the /rootdir. Even if the root page retrieval is basically simultaneous with that of robots.txt (as is often the case), there still should be no caching or referencing of the root page's data.

Yeah. And if wishes were horses... :)

12:32 pm on Feb 16, 2009 (gmt 0)

5+ Year Member



Sorry. What I tried to write:
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)
reads robots.txt, but doesn't care about the contents of robots.txt.

To my recollection, most that asked for robots.txt honored it.
Really bad bots request for robots.txt in order to get into the dark web and to confuse webmasters ("robots.txt? YES").

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.
Requesting my robots.txt leads to a site-wide ban.
3:16 pm on Feb 16, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thanks for clarifying! A follow-up re:

"Requesting my robots.txt leads to a site-wide ban."

I'm curious as to how you do that, and also why? There are some bots that actually honor it:)

That said, before I figured out robots.cgi, I was really reluctant to let any bot read a directory-detailed robots.txt, even a baited one. Now, the riffraff only see Disallow: /

FWIW, I allow all to access robots.txt but other than the Big 3, I block access to everything other than robots.txt, just in case. Additionally, the Big 3 are kept out of numerous directories via subdir .htaccess --

Whoa.

As I write this, I'm again whelmed by what a time-and-effort drain it is keeping the ramparts intact. An obsessively intriguing drain, but a drain nonetheless. And all of it unseen by site owners!

Anyway, getting back to AWS's bevy o' bots, Yet Another:

3:16 pm on Feb 16, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; [similarpages.com;...] info at similarpages dot com)

robots.txt? YES

5:07 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



(So much for even bothering to use a bot-related UA...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

robots.txt? NO

4:35 pm on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



(And the hits just keep on coming...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

robots.txt? NO

7:24 am on Mar 2, 2009 (gmt 0)

5+ Year Member



New one for me...

UA: goroam/1.0-SNAPSHOT (goraom geo crawler; [goroam.net...] info@goroam.net)

Requested robots.txt - Yes
Obeyed it - Yes (I white-list bots, so all but a tiny few are banned)

Came from: 67.202.0.n

1:50 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

robots.txt? NO

4:49 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2-[yada-yada].compute-1.amazonaws.com
AideRSS 2.0 (postrank.com)

robots.txt? NO

5:06 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Question: Are there any legitimate services using Amazon/EC2 that would prevent one from simply banning all Amazon/EC2 net ranges?
7:19 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Are there any legitimate services using Amazon/EC2

Legitimate, probably.

Worthwhile for webmasters, almost certainly not.

I block the whole lot and lose no sleep.

...

7:26 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do the same, Sam. At least I do on most of my sites. Cowbot slipped through on one of my domains last week and it's now operating from EC2. I just started a new thread about it.
9:21 pm on Mar 9, 2009 (gmt 0)

5+ Year Member



Sam, Gary,

Including AMAZON-01/03, AES and NET ranges?.

Cheers,
Phred

9:32 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yep.

[edited by: GaryK at 9:32 pm (utc) on Mar. 9, 2009]

11:59 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



ec2 -- Elastic Compute Cloud -- is just another name for server farm. And its anonymous clients are just as intrusive as those cloaked behind privatedns and secureserver others of their ilk. (imho)
12:31 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Pfui,

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

could be a page thumbnailer such as that used by Ask.com search for their "preview" function in search results.


GaryK,

Alexa/Internet Archiver uses Elastic Cloud Compute services.

So you may want to allow those sub-ranges of ECS.

Jim

This 278 message thread spans 10 pages: 278