homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 [7] 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit

 3:04 am on Jan 18, 2009 (gmt 0)

Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:


----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv: Gecko/20060728 Firefox/
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060909 Firefox/
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.



 5:20 pm on Jul 16, 2010 (gmt 0)

I get why they hide.

I don't think most are hiding.

AWS cloud services just happen to be cheap and fast, something a startup company with limited resources would find appealing.


 6:56 am on Jul 20, 2010 (gmt 0)

I too have been having problems with excessive spidering from Amazon EC2. I filed a complaint on their EC2 site for excessive spidering. I received a response back quite quickly informing me
We've determined that an Amazon EC2 instance was running at the IP address you provided in your abuse report. However, the instance has since been terminated. You shouldn't see any further abusive activity from the IP address you submitted.

Thanks again for alerting us to this issue

But the abuse continued and I further complained. I ended up with a lengthy email more or less telling me that they are not responsible for what bots their customers run and I should modify my robots.txt to restrict crawlers, robots, spiders and other automated paging indexers.
If, once your site has a modified robots.txt file in place, web crawls continue from EC2 address space, please file another abuse form. Failure to follow robots.txt is a violation of the EC2 Terms of Use, and will be handled as Abuse Instance (or abuse report).

I have delved a little further in order to ban all these ip addresses and eventually found a post on their forum
Dear Amazon EC2 customers,

We are pleased to announce that as part of our ongoing expansion, we have added a new public IP range. The current Amazon EC2 public address ranges are:

US East (Northern Virginia): ( - ( - ( - ( - ( - ( - [previously]

US West (Northern California): ( -

EU (Ireland): ( -


The Amazon EC2 Team

I will try an ban the ip ranges so hopefully this will stop it. Information on banning IP ranges can be found here [webmasterworld.com...]


 5:51 am on Jul 22, 2010 (gmt 0)

A quick update, I banned the IP's and so far so good, I have gone from 1,000+ pages spidered to 0.

This is the htaccess which I used:

Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from
Deny from


 1:16 pm on Aug 3, 2010 (gmt 0)


robots.txt? NO


 1:56 am on Aug 6, 2010 (gmt 0)

bitmagicbot/0.1 admin@bitmagic.in

robots.txt? NO


 2:11 am on Aug 6, 2010 (gmt 0)

Okay, this one I don't get. No UA at all? Usually if it's blank, my apache logs at least show a hyphen, similar to no referer. But here, see for yourself. UAs appear in the last set of quotes.

Curiouser: This last quote-set isn't even a hyphen:

ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" ""

And Curiouser: This last quote-set is... gone!

ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-"

Those were the only hits -- almost 5 minutes apart and no robots.txt either time ('natch). Oh, and that /dir is verboten to almost all bots. Interesting how it made a beeline right for it.


 2:51 pm on Aug 6, 2010 (gmt 0)

I don't know, but here's some info:

The usual behavior of Apache log-processing is to replace completely-blank Referer or User-Agent headers with a single hyphen, which it then shows in quotes as it does all the other "fields" in the log entry. So if either header is truly-blank, then we see a quoted hyphen in the logs. These hyphens are added at the "log-entry-text-generation" level, and do not exist in the internal HTTP_REFERER or HTTP_USER_AGENT server variables. Basically, the hyphen only exists in the text of the log file, and not in the real headers or internal server variables.

Some bad-bots actually send a hyphen in the Referer or User-Agent header. This is to by-pass simple access control rules (such as those often used in .htaccess and security scripts) which refuse requests with either or both of these headers blank. That is why some Webmasters choose to return a 403-Forbidden response to requests with blank Referer or User-Agent headers, but call a script to ban the IP address if either or both headers actually contain a hyphen.

The most common cause of truncated or garbled log entries is a fault in a log-parsing script. These scripts are used by many shared-hosting hosts to "sort out" the log entries for the whole server (one big log file for all customers' sites) into separate log files for each customer. One way to tell if you have this kind of set-up is that if there is a script like this in use, then your access and user-agent logs will only update once an hour or once a day instead of being updated in real time. Some hosts similarly 'delay' the error log, while others don't.

So in this case, there are a few possibilities. One is that the log-sorting script is defective and is dropping characters -- either due to a logic error, mis-implementation of I/O buffering, or other errors due to server resource overload.

The other --and this is highly-speculative-- is this user-agent is sending backspace-space-backspace sequences in the user-agent header, which has the visual effect of "erasing" characters on your screen. Non-printing characters like backspace *should not* be passed-through by the log-sorting scripts, but again, this might be a scripting error. Alternately, perhaps the User-Agent header contains multiple quotes itself, triggering a bug in the log-sorting or log-entry-generation code (I know that single occurrences of a quote character in the actual user-agent header are logged as expected but I'm not sure about multiples, as I've never tested it.)

More information might be inferred from looking at your access-control rules, and seeing which one(s) were likely responsible for your (successful) rejection of these requests. However, if these requests were immediately blocked by IP address (as I expect, knowing your 'deep feelings' for AmazonAWS) before any Referer or User-Agent header inspection, then there's not much you can learn by doing this.

Long way of saying -- "No idea, actually." :)



 4:23 pm on Aug 6, 2010 (gmt 0)

Thanks for your thoughts, Jim! The server's dedicated/colo so there's no log parsing script per se, other than Apache's extended/combined logging via httpd.conf prefs. (One script I do use is a tail so I still see the raw log.) In short, those coded lines were the real, raw deal.

And rewrite-wise, I use your (thanks:) --

## BLOCK *Faked* blank referer -OR- UA (malicious agents supply a literal hyphen as UA string)
RewriteCond %{HTTP_REFERER}<->%{HTTP_USER_AGENT} ^<->$

But I just stumbled onto which UA is involved with the oddities...

Now as to whether or not its code is nefarious or just plain lousy, I can't say. (Its conduct is definitely lousy.) But grepping through this and last month's logs, the ONLY hits from the exact same AmazonAWS address --


-- are ALL malformed, UA-wise... until one brand-new bot suddenly makes an appearance in the "UA" spot:

03/Aug/2010:12:58:31 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:04:10:08 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:10:24:00 -0700] "GET /fileA.html HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:12:12:16 -0700] "GET /fileB.html HTTP/1.1" 403 1468 "-" ""
[05/Aug/2010:12:18:08 -0700] "GET /FileB.html HTTP/1.1" 403 1468 "-" "bitmagicbot/0.1 admin@bitmagic.in"

I reported bitmagicbot yesterday in its own thread [webmasterworld.com] so I'll repost the month's curious hit list there. (Why? Because if nothing else, poor Gary and everyone else who gets notifications of changes to this thread will be spared a lot more all of a sudden:)

Somewhere along the copy-paste way, I goofed re the missing quote-set oddity, sorry. None was ever completely AWOL.


 5:44 am on Aug 15, 2010 (gmt 0)

FWIW from the Jerks File...
UA = Ruby/1.8.7; previously: Ruby EventMachine [webmasterworld.com]
20 hits in 7 secs; no robots.txt; double-hits = alternated HEADs, GETs. All 403'd.

08/14 19:25:17 /dir1/fileA.html
08/14 19:25:17 /dir2/fileB.html
08/14 19:25:18 /dir1/fileA.html
08/14 19:25:18 /dir2/fileB.html

08/14 19:25:14 /dir1/fileC.html
08/14 19:25:16 /dir1/fileC.html
08/14 19:25:16 /dir1/fileD.html
08/14 19:25:17 /dir1/fileE.html
08/14 19:25:18 /dir1/fileD.html
08/14 19:25:18 /dir1/fileE.html

08/14 19:25:11 /dir3/fileF.html
08/14 19:25:11 /dir2/fileG.html
08/14 19:25:12 /dir3/fileF.html
08/14 19:25:12 /dir2/fileG.html
08/14 19:25:13 /
08/14 19:25:14 /
08/14 19:25:16 /dir1/fileH.html
08/14 19:25:16 /dir1/fileG.html
08/14 19:25:17 /dir1/fileH.html
08/14 19:25:18 /dir1/fileG.html


 5:25 pm on Aug 17, 2010 (gmt 0)

Silverton/1.0 Crawler

robots.txt? NO

Previously... [webmasterworld.com...]


 5:44 am on Aug 18, 2010 (gmt 0)

Twitter swarmer:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (+http://flipboard.com/crawler)

robots.txt? NO

HEAD then GET to same html doc; 2 hits in 2 secs.


 2:49 am on Aug 24, 2010 (gmt 0)

I can only hope this one's a fake...

Googlebot/2.1 (+http://www.google.com/bot.html)

robots.txt? NO


 9:20 am on Aug 27, 2010 (gmt 0)

Host: ec2-184-73-90-164.compute-1.amazonaws.com
Referer: -
Agent: Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com]

robots.txt? NO


 3:20 pm on Aug 27, 2010 (gmt 0)

paul at page-store

"Paul" has been on AMAZON-EC2-3 for a while and has a very special place in my .htaccess file where only a few make it, for at least 2.5 years.

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise


 8:19 pm on Sep 2, 2010 (gmt 0)

Came across another AWS block today: - - Ireland (Amazon EU)

Full Amazon range (all of it blocked here) is full 46.51.128/17


 9:37 pm on Sep 2, 2010 (gmt 0)

Thanks dstiles


 7:03 am on Sep 11, 2010 (gmt 0)

Just spotted this crawling a link (twice, apparently) to one of my sites. Don't care much for the Host, of course. But am also less than happy to see the UA's "ru" localization: We have way, waaay too many problems with most Things Russian. Then again, it could just be a fake UA --

Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv: Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)
09/10 22:41:39 /
09/10 22:43:36 /

robots.txt? NO


 9:17 pm on Sep 11, 2010 (gmt 0)

Firefox 4? Still on 3.6.9 here.


 8:53 am on Sep 14, 2010 (gmt 0)

SockrollBot/.01 (+http://www.sockroll.com/roll/SockrollBot)

robots.txt? NO


 9:00 pm on Sep 22, 2010 (gmt 0)

(Note: I've munged the URL's in the UA's by adding a space and removing dots).

UA: LucidMedia ClickSense/4.0 (support@lucidmedia.com; http:// www lucidmedia com/)

IP: 174.129.118.nnn
rDNS: ec2-174-129-118-nnn.compute-1.amazonaws.com
robots.txt: Yes
obeys robots.txt: To be determined.
Purpose: their page doesn't explain, but probably something to do with advertising and click through's. Not sure why that would get me a visit as I don't advertise.

Old Thread: [webmasterworld.com...]

Project Honeypot
Geographic Location [United States] United States
Spider First Seen approximately 1 year, 6 months, 1 week ago
Spider Last Seen within 1 year, 6 months, 1 week
Spider Sightings 1 visit(s)
User-Agents seen with 1 user-agent(s)
Threat Rating (Not yet scored. Check again soon.)

User Agent Strings
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; http:// www similarpages com; info at similarpages dot com)


 2:18 pm on Sep 27, 2010 (gmt 0)

Back to the Stone Age with this UA:

Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)

robots.txt? NO


 9:12 pm on Sep 27, 2010 (gmt 0)

@Pfui, If you really want stone age, try these ones
Mozilla/0.6 Beta (Windows)
Mozilla/0.91 Beta (Windows)
Mozilla/1.22 (compatible; MSIE 2.0; Windows 95)
Mozilla/2.0 (compatible; MSIE 3.02; Windows CE; 240x320)
Mozilla/3.0 (compatible; WebCapture 2.0; Auto; Windows)

This is from a bot that likes to scan my guest book looking for a page to post comments. That is just a few of about a 103 Fake UA's that this bot likes to use.


 4:31 pm on Oct 23, 2010 (gmt 0)

This all-too-real nasty app came by a bit ago:

Wget/1.11.4 Red Hat modified

robots.txt? NO


 7:10 pm on Oct 26, 2010 (gmt 0)


robots.txt? NO


 9:15 pm on Oct 27, 2010 (gmt 0)

Just discovered another Amazon IP range, allocated 2010-10-07: -

"The activity ... from a dynamic hosting environment."


 5:48 am on Nov 1, 2010 (gmt 0)

Confirmed, just saw Netcraft using it.

NetName: AMAZON-EC2-8


 6:45 am on Nov 1, 2010 (gmt 0)

Thanks dstiles, caribguy


 6:21 pm on Nov 1, 2010 (gmt 0)

FWIW: Yet another instance of an AWS server running with/compromised by a botnet. Turkey this time: [projecthoneypot.org...] The first hit of each pair was self-referred to the same bot-verboten .cgi file, the second to root.

Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:50:08
11/01 09:50:15

Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:50:55
11/01 09:50:56

Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:51:43
11/01 09:51:54

So much for cloud security. Again.


 8:57 pm on Nov 6, 2010 (gmt 0)

Just discovered an AWS IP trying to get through using a CoreIX proxy.

The AWS Irish range - 125.127.255 is blocked in IIS - ie I never see those IPs in my logs 'cause they are blocked before they can hit a web site.

The IP attempted some hits today using as a proxy, hitting three times using IE and firefox UAs about 30 seconds apart.

There was another identical (single) proxy hit a few days ago to a UK trap domain using an IE UA beginning "IE 7 x Mozilla 4/0..." where x was a UTF "D"-ish character.

There is no log of such an access last month using CoreIX. I haven't gone back before that.


 11:38 pm on Nov 8, 2010 (gmt 0)

UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
IP: 50.16.85.nnn
rDNS: ec2-50-16-85-nnn.compute-1.amazonaws.com.

Robots.txt: No

According to their web site title "Anti-Phising and PCI Security Services" .
Their website doesn't mention NetcraftSurveyAgent that I could find.


 2:43 am on Nov 9, 2010 (gmt 0)

NetcraftSurveyAgent has been around for a few years at least, originally hailing from lager.netcraft.com using the same UA and typically HEAD-requesting:


That's path's above the webspace but the file is accessible (and the URI not easily blocked, imho), because the file's an Apache OS image. I think it's really, really sneaky Netcraft probes that way.

Also, Netcraft's been bot-running from AWS since, oh, early this year. Regardless of host, it never asks for robots.txt. Then again, the two robots.txt files they use (search their site for: robots.txt) are syntactically incorrect/ineffectual.

[edited by: Pfui at 2:45 am (utc) on Nov 9, 2010]

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 [7] 8 9 10 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved