homepage Welcome to WebmasterWorld Guest from 54.167.173.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: 278 ( [1] 2 3 4 5 6 7 8 9 ... 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

incrediBILL




msg:3830853
 11:55 pm on Jan 20, 2009 (gmt 0)

At this point amazonaws.com is synonymous with bad activity so blocking them becomes the de facto standard.

The only concern is what happens when the "next big thing" uses their cloud computing facility and those of us blocking it because of all the bad bot noise totally miss out on being in the Google killer until it's too late?

jdMorgan




msg:3830899
 1:16 am on Jan 21, 2009 (gmt 0)

Pfui,

Very nice list!

But for clarity, what is the meaning of the "Robots.txt? Yes/No" entries above?

The user-agent fetches robots.txt
-or-
The user-agent fetches and obeys robots.txt

Thanks,
Jim

Pfui




msg:3837755
 9:19 pm on Jan 29, 2009 (gmt 0)

Sorry for the belated reply. (Clicking 'Email notification of new replies' hasn't worked for me for a long time.)

-----
Jim:

"robots.txt? YES" <== Asked for it
"robots.txt? NO" <== Didn't ask

To my recollection, most that asked for robots.txt honored it. I didn't really pay attention because that's the only file I let unauthorized bots -- and amazonaws.com, the host -- access.

By the way, here's another one. If this were an eBay item, they'd get yanked for keyword-spamming the spider's brand!

Nutch/Nutch-1.0-dev (A Nutch-based crawler.; [lucene.apache.org...] nutch-agent AT lucene.apache.org)
robots.txt? YES

-----
Bill:

At this point, I don't know what I think about AWS, other than I'm amazed at (& irked by) all of its/their bots. (Plus their CloudFront info is so PC-centric that this Mac/Apache person's eyes just glaze over.) Why spider constantly, I wonder? Are they trying to rival Google? What are they doing with all of the data? Where are they going with all of their really geeky doodads?

Oh, to be a fly on the wall in Bezos's office, eh?

-----
All:

What are your thoughts on their Web Services? Are you intrigued? Have you tested anything and noticed more spidering? Their cookies systems are fantastically inbred -- notice anything new in that dept.?

What I do know is that between amazon.com and imdb.com and amazonfresh.com (fresh.amazon.com) and amazonaws.com (aws.amazon.com), they're all over the map. Literally. And their spiders are squished all over my server. Literally.

: )

[edited by: Pfui at 9:21 pm (utc) on Jan. 29, 2009]

janharders




msg:3837801
 10:23 pm on Jan 29, 2009 (gmt 0)

it's ec, so they're charged by traffic and computing time, right?
let them pay. if you discover they're spidering your site and you can spare the traffic, just send big, complex files. they need some form of parser, and handling big documents will slow their crawlers down and cost 'em money. if they' using a regexp to retrieve urls to crawl, it won't be that funny, but if they're using a stack based parser and you sent them a document with thousands of stacked nodes, it might get funny ;)

generally, I'm using amazon-webservices (the alexa information thingy) and the ecommerce-webservice, though both of them not that heavy (my bill averages to $3 / month). I usually don't mind bot traffic (traffic is free and my servers can easily handle the load right now) if the bot behaves. If it doesn't, I don't care if it's an official bot by amazon or just one that's hosted on their platform, it get's blocked, most likely on iptables, so I don't have all those 403-errors in my logs.

Pfui




msg:3844071
 8:22 pm on Feb 6, 2009 (gmt 0)

Today's new ec2-[yada-yada].compute-1.amazonaws.com bot, and/or its botrunner, was more rude than most. No robots.txt, then hit 11 times in 5 visits over 4 hours (thus far). Perhaps not so coincidentally, Every. Single. Hit. was to an ALL-robot restricted page, post, or zip.

Mozilla/5.0 (compatible; kmky-not-a-bot/0.2; [kilomonkey.com...] )
robots.txt? NO

Even the bot's home page is rude ('notabot.txt' timed out): "This Server contains no useful information. The Domain name is not for sale. Goodbye."

Well, 403 back atcha, not-notabot.

(Hmm... That domain's listed to Kilo Monkey LLC, and an Aaron Flin -- of Dogpile metasearch fame?)

GaryK




msg:3844763
 12:05 am on Feb 8, 2009 (gmt 0)

My simple solution was to ban all Amazon IP Addresses. There's a thread here somewhere with all their net ranges that someone was kind enough to post for me. Problem solved! :)

Pfui




msg:3845591
 2:02 pm on Feb 9, 2009 (gmt 0)

That'd work, unless you're an Amazon associate/affiliate. Then banning all of their IPs can be problematic. From:

Amazon.com Associates Central - Help
No relevant products are showing on my page. Why is this the case?

"You must allow our spider (Mozilla/5.0 (compatible; >> AMZNKAssocBot/4.0)) to crawl your website. The crawl is needed in order to identify the content of your website and provide matching products. If you do not allow our spider to crawl your website(s) we will display selected products from our product lines in the Omakase Links."

FWIW, I've found this name works A-OK in rewrites:

Mozilla.*AMZNKAssocBot

Oh, and coming up, Yet Another AWS Bot. (If nothing else, this thread provides a comprehensive list of their automatons, eh? :)

Pfui




msg:3845593
 2:04 pm on Feb 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)

robots.txt? NO

blend27




msg:3845603
 2:45 pm on Feb 9, 2009 (gmt 0)

-- If your web site doesn't want to be crawled, please send us a email --

Hmmm,

Let me ask it, hold on, brb.... cool I am back! I asked and it is not saying anything, I even tried a "pretty please" thingy, still not talking to me, and I am the owner. I think I'll have to learn that new Telepathic computer language that enables humans to talk to text files, the one that everybody is talking about you know...

It's nice and shiny though ;)

Pfui




msg:3847432
 7:25 pm on Feb 11, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

Pfui




msg:3848007
 3:25 pm on Feb 12, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (botmobi find.mobi/bot.html find@mtld.mobi)/

robots.txt? YES

Pfui




msg:3848081
 4:37 pm on Feb 12, 2009 (gmt 0)

Incredibly, still more newcomers courtesy of .compute-1.amazonaws.com:

UA: -
robots.txt? NO

Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
robots.txt? NO

Python-urllib/2.4
robots.txt? NO

ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
robots.txt? YES

P.S./FWIW:

In addition to owning AmazonAWS.com, IMDb.com, A9.com, and who knows what else, Amazon.com owns Archive.org (WayBackMachine; Internet Archive) a.k.a. Alexa.com a.k.a. ia_archiver and ia_archiver-web.

thetrasher




msg:3850196
 9:54 pm on Feb 15, 2009 (gmt 0)

Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

You may write whatever you want into your robots.txt.
User-agent: *
Disallow: /

doesn't prevent requesting the root page.

Pfui




msg:3850263
 11:49 pm on Feb 15, 2009 (gmt 0)

I'm not sure I understand your point, sorry.

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.

But a blanket "Disallow: /" means Do Not Crawl Here. Go Away. Now. And that Disallow includes the root page because it's in the /rootdir. Even if the root page retrieval is basically simultaneous with that of robots.txt (as is often the case), there still should be no caching or referencing of the root page's data.

Yeah. And if wishes were horses... :)

thetrasher




msg:3850569
 12:32 pm on Feb 16, 2009 (gmt 0)

Sorry. What I tried to write:
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)
reads robots.txt, but doesn't care about the contents of robots.txt.

To my recollection, most that asked for robots.txt honored it.
Really bad bots request for robots.txt in order to get into the dark web and to confuse webmasters ("robots.txt? YES").

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.
Requesting my robots.txt leads to a site-wide ban.

Pfui




msg:3850661
 3:16 pm on Feb 16, 2009 (gmt 0)

Thanks for clarifying! A follow-up re:

"Requesting my robots.txt leads to a site-wide ban."

I'm curious as to how you do that, and also why? There are some bots that actually honor it:)

That said, before I figured out robots.cgi, I was really reluctant to let any bot read a directory-detailed robots.txt, even a baited one. Now, the riffraff only see Disallow: /

FWIW, I allow all to access robots.txt but other than the Big 3, I block access to everything other than robots.txt, just in case. Additionally, the Big 3 are kept out of numerous directories via subdir .htaccess --

Whoa.

As I write this, I'm again whelmed by what a time-and-effort drain it is keeping the ramparts intact. An obsessively intriguing drain, but a drain nonetheless. And all of it unseen by site owners!

Anyway, getting back to AWS's bevy o' bots, Yet Another:

Pfui




msg:3850662
 3:16 pm on Feb 16, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; [similarpages.com;...] info at similarpages dot com)

robots.txt? YES

Pfui




msg:3851587
 5:07 pm on Feb 17, 2009 (gmt 0)

(So much for even bothering to use a bot-related UA...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

robots.txt? NO

Pfui




msg:3858337
 4:35 pm on Feb 26, 2009 (gmt 0)

(And the hits just keep on coming...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

robots.txt? NO

Mokita




msg:3860792
 7:24 am on Mar 2, 2009 (gmt 0)

New one for me...

UA: goroam/1.0-SNAPSHOT (goraom geo crawler; [goroam.net...] info@goroam.net)

Requested robots.txt - Yes
Obeyed it - Yes (I white-list bots, so all but a tiny few are banned)

Came from: 67.202.0.n

Pfui




msg:3866253
 1:50 pm on Mar 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

robots.txt? NO

Pfui




msg:3866413
 4:49 pm on Mar 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
AideRSS 2.0 (postrank.com)

robots.txt? NO

GaryK




msg:3866428
 5:06 pm on Mar 9, 2009 (gmt 0)

Question: Are there any legitimate services using Amazon/EC2 that would prevent one from simply banning all Amazon/EC2 net ranges?

Samizdata




msg:3866560
 7:19 pm on Mar 9, 2009 (gmt 0)

Are there any legitimate services using Amazon/EC2

Legitimate, probably.

Worthwhile for webmasters, almost certainly not.

I block the whole lot and lose no sleep.

...

GaryK




msg:3866568
 7:26 pm on Mar 9, 2009 (gmt 0)

I do the same, Sam. At least I do on most of my sites. Cowbot slipped through on one of my domains last week and it's now operating from EC2. I just started a new thread about it.

phred




msg:3866677
 9:21 pm on Mar 9, 2009 (gmt 0)

Sam, Gary,

Including AMAZON-01/03, AES and NET ranges?.

Cheers,
Phred

GaryK




msg:3866691
 9:32 pm on Mar 9, 2009 (gmt 0)

Yep.

[edited by: GaryK at 9:32 pm (utc) on Mar. 9, 2009]

Pfui




msg:3866788
 11:59 pm on Mar 9, 2009 (gmt 0)

ec2 -- Elastic Compute Cloud -- is just another name for server farm. And its anonymous clients are just as intrusive as those cloaked behind privatedns and secureserver others of their ilk. (imho)

jdMorgan




msg:3866800
 12:31 am on Mar 10, 2009 (gmt 0)

Pfui,

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

could be a page thumbnailer such as that used by Ask.com search for their "preview" function in search results.


GaryK,

Alexa/Internet Archiver uses Elastic Cloud Compute services.

So you may want to allow those sub-ranges of ECS.

Jim

This 278 message thread spans 10 pages: 278 ( [1] 2 3 4 5 6 7 8 9 ... 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved