homepage Welcome to WebmasterWorld Guest from 23.23.57.144
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: 278 ( [1] 2 3 4 5 6 7 8 9 ... 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 11:55 pm on Jan 20, 2009 (gmt 0)

At this point amazonaws.com is synonymous with bad activity so blocking them becomes the de facto standard.

The only concern is what happens when the "next big thing" uses their cloud computing facility and those of us blocking it because of all the bad bot noise totally miss out on being in the Google killer until it's too late?

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3828718 posted 1:16 am on Jan 21, 2009 (gmt 0)

Pfui,

Very nice list!

But for clarity, what is the meaning of the "Robots.txt? Yes/No" entries above?

The user-agent fetches robots.txt
-or-
The user-agent fetches and obeys robots.txt

Thanks,
Jim

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 9:19 pm on Jan 29, 2009 (gmt 0)

Sorry for the belated reply. (Clicking 'Email notification of new replies' hasn't worked for me for a long time.)

-----
Jim:

"robots.txt? YES" <== Asked for it
"robots.txt? NO" <== Didn't ask

To my recollection, most that asked for robots.txt honored it. I didn't really pay attention because that's the only file I let unauthorized bots -- and amazonaws.com, the host -- access.

By the way, here's another one. If this were an eBay item, they'd get yanked for keyword-spamming the spider's brand!

Nutch/Nutch-1.0-dev (A Nutch-based crawler.; [lucene.apache.org...] nutch-agent AT lucene.apache.org)
robots.txt? YES

-----
Bill:

At this point, I don't know what I think about AWS, other than I'm amazed at (& irked by) all of its/their bots. (Plus their CloudFront info is so PC-centric that this Mac/Apache person's eyes just glaze over.) Why spider constantly, I wonder? Are they trying to rival Google? What are they doing with all of the data? Where are they going with all of their really geeky doodads?

Oh, to be a fly on the wall in Bezos's office, eh?

-----
All:

What are your thoughts on their Web Services? Are you intrigued? Have you tested anything and noticed more spidering? Their cookies systems are fantastically inbred -- notice anything new in that dept.?

What I do know is that between amazon.com and imdb.com and amazonfresh.com (fresh.amazon.com) and amazonaws.com (aws.amazon.com), they're all over the map. Literally. And their spiders are squished all over my server. Literally.

: )

[edited by: Pfui at 9:21 pm (utc) on Jan. 29, 2009]

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 10:23 pm on Jan 29, 2009 (gmt 0)

it's ec, so they're charged by traffic and computing time, right?
let them pay. if you discover they're spidering your site and you can spare the traffic, just send big, complex files. they need some form of parser, and handling big documents will slow their crawlers down and cost 'em money. if they' using a regexp to retrieve urls to crawl, it won't be that funny, but if they're using a stack based parser and you sent them a document with thousands of stacked nodes, it might get funny ;)

generally, I'm using amazon-webservices (the alexa information thingy) and the ecommerce-webservice, though both of them not that heavy (my bill averages to $3 / month). I usually don't mind bot traffic (traffic is free and my servers can easily handle the load right now) if the bot behaves. If it doesn't, I don't care if it's an official bot by amazon or just one that's hosted on their platform, it get's blocked, most likely on iptables, so I don't have all those 403-errors in my logs.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 8:22 pm on Feb 6, 2009 (gmt 0)

Today's new ec2-[yada-yada].compute-1.amazonaws.com bot, and/or its botrunner, was more rude than most. No robots.txt, then hit 11 times in 5 visits over 4 hours (thus far). Perhaps not so coincidentally, Every. Single. Hit. was to an ALL-robot restricted page, post, or zip.

Mozilla/5.0 (compatible; kmky-not-a-bot/0.2; [kilomonkey.com...] )
robots.txt? NO

Even the bot's home page is rude ('notabot.txt' timed out): "This Server contains no useful information. The Domain name is not for sale. Goodbye."

Well, 403 back atcha, not-notabot.

(Hmm... That domain's listed to Kilo Monkey LLC, and an Aaron Flin -- of Dogpile metasearch fame?)

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 12:05 am on Feb 8, 2009 (gmt 0)

My simple solution was to ban all Amazon IP Addresses. There's a thread here somewhere with all their net ranges that someone was kind enough to post for me. Problem solved! :)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:02 pm on Feb 9, 2009 (gmt 0)

That'd work, unless you're an Amazon associate/affiliate. Then banning all of their IPs can be problematic. From:

Amazon.com Associates Central - Help
No relevant products are showing on my page. Why is this the case?

"You must allow our spider (Mozilla/5.0 (compatible; >> AMZNKAssocBot/4.0)) to crawl your website. The crawl is needed in order to identify the content of your website and provide matching products. If you do not allow our spider to crawl your website(s) we will display selected products from our product lines in the Omakase Links."

FWIW, I've found this name works A-OK in rewrites:

Mozilla.*AMZNKAssocBot

Oh, and coming up, Yet Another AWS Bot. (If nothing else, this thread provides a comprehensive list of their automatons, eh? :)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:04 pm on Feb 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)

robots.txt? NO

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:45 pm on Feb 9, 2009 (gmt 0)

-- If your web site doesn't want to be crawled, please send us a email --

Hmmm,

Let me ask it, hold on, brb.... cool I am back! I asked and it is not saying anything, I even tried a "pretty please" thingy, still not talking to me, and I am the owner. I think I'll have to learn that new Telepathic computer language that enables humans to talk to text files, the one that everybody is talking about you know...

It's nice and shiny though ;)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:25 pm on Feb 11, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:25 pm on Feb 12, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3 (botmobi find.mobi/bot.html find@mtld.mobi)/

robots.txt? YES

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:37 pm on Feb 12, 2009 (gmt 0)

Incredibly, still more newcomers courtesy of .compute-1.amazonaws.com:

UA: -
robots.txt? NO

Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
robots.txt? NO

Python-urllib/2.4
robots.txt? NO

ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
robots.txt? YES

P.S./FWIW:

In addition to owning AmazonAWS.com, IMDb.com, A9.com, and who knows what else, Amazon.com owns Archive.org (WayBackMachine; Internet Archive) a.k.a. Alexa.com a.k.a. ia_archiver and ia_archiver-web.

thetrasher

5+ Year Member



 
Msg#: 3828718 posted 9:54 pm on Feb 15, 2009 (gmt 0)

Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)

robots.txt? YES

You may write whatever you want into your robots.txt.
User-agent: *
Disallow: /

doesn't prevent requesting the root page.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 11:49 pm on Feb 15, 2009 (gmt 0)

I'm not sure I understand your point, sorry.

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.

But a blanket "Disallow: /" means Do Not Crawl Here. Go Away. Now. And that Disallow includes the root page because it's in the /rootdir. Even if the root page retrieval is basically simultaneous with that of robots.txt (as is often the case), there still should be no caching or referencing of the root page's data.

Yeah. And if wishes were horses... :)

thetrasher

5+ Year Member



 
Msg#: 3828718 posted 12:32 pm on Feb 16, 2009 (gmt 0)

Sorry. What I tried to write:
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)
reads robots.txt, but doesn't care about the contents of robots.txt.

To my recollection, most that asked for robots.txt honored it.
Really bad bots request for robots.txt in order to get into the dark web and to confuse webmasters ("robots.txt? YES").

Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.
Requesting my robots.txt leads to a site-wide ban.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:16 pm on Feb 16, 2009 (gmt 0)

Thanks for clarifying! A follow-up re:

"Requesting my robots.txt leads to a site-wide ban."

I'm curious as to how you do that, and also why? There are some bots that actually honor it:)

That said, before I figured out robots.cgi, I was really reluctant to let any bot read a directory-detailed robots.txt, even a baited one. Now, the riffraff only see Disallow: /

FWIW, I allow all to access robots.txt but other than the Big 3, I block access to everything other than robots.txt, just in case. Additionally, the Big 3 are kept out of numerous directories via subdir .htaccess --

Whoa.

As I write this, I'm again whelmed by what a time-and-effort drain it is keeping the ramparts intact. An obsessively intriguing drain, but a drain nonetheless. And all of it unseen by site owners!

Anyway, getting back to AWS's bevy o' bots, Yet Another:

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:16 pm on Feb 16, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; [similarpages.com;...] info at similarpages dot com)

robots.txt? YES

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:07 pm on Feb 17, 2009 (gmt 0)

(So much for even bothering to use a bot-related UA...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:35 pm on Feb 26, 2009 (gmt 0)

(And the hits just keep on coming...)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

robots.txt? NO

Mokita

5+ Year Member



 
Msg#: 3828718 posted 7:24 am on Mar 2, 2009 (gmt 0)

New one for me...

UA: goroam/1.0-SNAPSHOT (goraom geo crawler; [goroam.net...] info@goroam.net)

Requested robots.txt - Yes
Obeyed it - Yes (I white-list bots, so all but a tiny few are banned)

Came from: 67.202.0.n

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 1:50 pm on Mar 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:49 pm on Mar 9, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
AideRSS 2.0 (postrank.com)

robots.txt? NO

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 5:06 pm on Mar 9, 2009 (gmt 0)

Question: Are there any legitimate services using Amazon/EC2 that would prevent one from simply banning all Amazon/EC2 net ranges?

Samizdata

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:19 pm on Mar 9, 2009 (gmt 0)

Are there any legitimate services using Amazon/EC2

Legitimate, probably.

Worthwhile for webmasters, almost certainly not.

I block the whole lot and lose no sleep.

...

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 7:26 pm on Mar 9, 2009 (gmt 0)

I do the same, Sam. At least I do on most of my sites. Cowbot slipped through on one of my domains last week and it's now operating from EC2. I just started a new thread about it.

phred

5+ Year Member



 
Msg#: 3828718 posted 9:21 pm on Mar 9, 2009 (gmt 0)

Sam, Gary,

Including AMAZON-01/03, AES and NET ranges?.

Cheers,
Phred

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3828718 posted 9:32 pm on Mar 9, 2009 (gmt 0)

Yep.

[edited by: GaryK at 9:32 pm (utc) on Mar. 9, 2009]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 11:59 pm on Mar 9, 2009 (gmt 0)

ec2 -- Elastic Compute Cloud -- is just another name for server farm. And its anonymous clients are just as intrusive as those cloaked behind privatedns and secureserver others of their ilk. (imho)

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3828718 posted 12:31 am on Mar 10, 2009 (gmt 0)

Pfui,

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18

could be a page thumbnailer such as that used by Ask.com search for their "preview" function in search results.


GaryK,

Alexa/Internet Archiver uses Elastic Cloud Compute services.

So you may want to allow those sub-ranges of ECS.

Jim

This 278 message thread spans 10 pages: 278 ( [1] 2 3 4 5 6 7 8 9 ... 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved