homepage Welcome to WebmasterWorld Guest from 54.197.110.151
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 [7] 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 5:20 pm on Jul 16, 2010 (gmt 0)

I get why they hide.


I don't think most are hiding.

AWS cloud services just happen to be cheap and fast, something a startup company with limited resources would find appealing.

affiliation

5+ Year Member



 
Msg#: 3828718 posted 6:56 am on Jul 20, 2010 (gmt 0)

I too have been having problems with excessive spidering from Amazon EC2. I filed a complaint on their EC2 site for excessive spidering. I received a response back quite quickly informing me
We've determined that an Amazon EC2 instance was running at the IP address you provided in your abuse report. However, the instance has since been terminated. You shouldn't see any further abusive activity from the IP address you submitted.

Thanks again for alerting us to this issue

But the abuse continued and I further complained. I ended up with a lengthy email more or less telling me that they are not responsible for what bots their customers run and I should modify my robots.txt to restrict crawlers, robots, spiders and other automated paging indexers.
If, once your site has a modified robots.txt file in place, web crawls continue from EC2 address space, please file another abuse form. Failure to follow robots.txt is a violation of the EC2 Terms of Use, and will be handled as Abuse Instance (or abuse report).


I have delved a little further in order to ban all these ip addresses and eventually found a post on their forum
Dear Amazon EC2 customers,

We are pleased to announce that as part of our ongoing expansion, we have added a new public IP range. The current Amazon EC2 public address ranges are:

US East (Northern Virginia):

216.182.224.0/20 (216.182.224.0 - 216.182.239.255)
72.44.32.0/19 (72.44.32.0 - 72.44.63.255)
67.202.0.0/18 (67.202.0.0 - 67.202.63.255)
75.101.128.0/17 (75.101.128.0 - 75.101.255.255)
174.129.0.0/16 (174.129.0.0 - 174.129.255.255)
204.236.192.0/18 (204.236.192.0 - 204.236.255.255) [previously 204.236.224.0/19]

US West (Northern California):

204.236.128.0/18 (216.236.128.0 - 216.236.191.255)

EU (Ireland):

79.125.0.0/17 (79.125.0.0 - 79.125.127.255)

Sincerely,

The Amazon EC2 Team


I will try an ban the ip ranges so hopefully this will stop it. Information on banning IP ranges can be found here [webmasterworld.com...]

affiliation

5+ Year Member



 
Msg#: 3828718 posted 5:51 am on Jul 22, 2010 (gmt 0)

A quick update, I banned the IP's and so far so good, I have gone from 1,000+ pages spidered to 0.

This is the htaccess which I used:

Deny from 216.182.224.0/20
Deny from 72.44.32.0/19
Deny from 67.202.0.0/18
Deny from 75.101.128.0/17
Deny from 174.129.0.0/16
Deny from 204.236.192.0/18
Deny from 79.125.0.0/17
Deny from 184.72.0.0/18
Deny from 184.73.0.0/16
Deny from 175.41.128.0/18
Deny from 184.72.128.0/17
Deny from 204.236.128.0/18

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 1:16 pm on Aug 3, 2010 (gmt 0)

ec2-174-129-21-61.compute-1.amazonaws.com
Crawler

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 1:56 am on Aug 6, 2010 (gmt 0)

ec2-184-73-234-197.compute-1.amazonaws.com
bitmagicbot/0.1 admin@bitmagic.in

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:11 am on Aug 6, 2010 (gmt 0)

Okay, this one I don't get. No UA at all? Usually if it's blank, my apache logs at least show a hyphen, similar to no referer. But here, see for yourself. UAs appear in the last set of quotes.

Curiouser: This last quote-set isn't even a hyphen:

ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" ""

And Curiouser: This last quote-set is... gone!

ec2-184-73-234-197.compute-1.amazonaws.com - - [05/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-"

Those were the only hits -- almost 5 minutes apart and no robots.txt either time ('natch). Oh, and that /dir is verboten to almost all bots. Interesting how it made a beeline right for it.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3828718 posted 2:51 pm on Aug 6, 2010 (gmt 0)

I don't know, but here's some info:

The usual behavior of Apache log-processing is to replace completely-blank Referer or User-Agent headers with a single hyphen, which it then shows in quotes as it does all the other "fields" in the log entry. So if either header is truly-blank, then we see a quoted hyphen in the logs. These hyphens are added at the "log-entry-text-generation" level, and do not exist in the internal HTTP_REFERER or HTTP_USER_AGENT server variables. Basically, the hyphen only exists in the text of the log file, and not in the real headers or internal server variables.

Some bad-bots actually send a hyphen in the Referer or User-Agent header. This is to by-pass simple access control rules (such as those often used in .htaccess and security scripts) which refuse requests with either or both of these headers blank. That is why some Webmasters choose to return a 403-Forbidden response to requests with blank Referer or User-Agent headers, but call a script to ban the IP address if either or both headers actually contain a hyphen.

The most common cause of truncated or garbled log entries is a fault in a log-parsing script. These scripts are used by many shared-hosting hosts to "sort out" the log entries for the whole server (one big log file for all customers' sites) into separate log files for each customer. One way to tell if you have this kind of set-up is that if there is a script like this in use, then your access and user-agent logs will only update once an hour or once a day instead of being updated in real time. Some hosts similarly 'delay' the error log, while others don't.

So in this case, there are a few possibilities. One is that the log-sorting script is defective and is dropping characters -- either due to a logic error, mis-implementation of I/O buffering, or other errors due to server resource overload.

The other --and this is highly-speculative-- is this user-agent is sending backspace-space-backspace sequences in the user-agent header, which has the visual effect of "erasing" characters on your screen. Non-printing characters like backspace *should not* be passed-through by the log-sorting scripts, but again, this might be a scripting error. Alternately, perhaps the User-Agent header contains multiple quotes itself, triggering a bug in the log-sorting or log-entry-generation code (I know that single occurrences of a quote character in the actual user-agent header are logged as expected but I'm not sure about multiples, as I've never tested it.)

More information might be inferred from looking at your access-control rules, and seeing which one(s) were likely responsible for your (successful) rejection of these requests. However, if these requests were immediately blocked by IP address (as I expect, knowing your 'deep feelings' for AmazonAWS) before any Referer or User-Agent header inspection, then there's not much you can learn by doing this.

Long way of saying -- "No idea, actually." :)

Jim

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:23 pm on Aug 6, 2010 (gmt 0)

Thanks for your thoughts, Jim! The server's dedicated/colo so there's no log parsing script per se, other than Apache's extended/combined logging via httpd.conf prefs. (One script I do use is a tail so I still see the raw log.) In short, those coded lines were the real, raw deal.

And rewrite-wise, I use your (thanks:) --

## BLOCK *Faked* blank referer -OR- UA (malicious agents supply a literal hyphen as UA string)
RewriteCond %{HTTP_REFERER}<->%{HTTP_USER_AGENT} ^<->$

But I just stumbled onto which UA is involved with the oddities...

Now as to whether or not its code is nefarious or just plain lousy, I can't say. (Its conduct is definitely lousy.) But grepping through this and last month's logs, the ONLY hits from the exact same AmazonAWS address --

ec2-184-73-234-197.compute-1.amazonaws.com

-- are ALL malformed, UA-wise... until one brand-new bot suddenly makes an appearance in the "UA" spot:

[
03/Aug/2010:12:58:31 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:04:10:08 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:46:52 -0700] "GET / HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:06:51:37 -0700] "GET /dir HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:10:24:00 -0700] "GET /fileA.html HTTP/1.1" 403 1468 "-" ""
[04/Aug/2010:12:12:16 -0700] "GET /fileB.html HTTP/1.1" 403 1468 "-" ""
[05/Aug/2010:12:18:08 -0700] "GET /FileB.html HTTP/1.1" 403 1468 "-" "bitmagicbot/0.1 admin@bitmagic.in"

I reported bitmagicbot yesterday in its own thread [webmasterworld.com] so I'll repost the month's curious hit list there. (Why? Because if nothing else, poor Gary and everyone else who gets notifications of changes to this thread will be spared a lot more all of a sudden:)

P.S.
Somewhere along the copy-paste way, I goofed re the missing quote-set oddity, sorry. None was ever completely AWOL.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:44 am on Aug 15, 2010 (gmt 0)

FWIW from the Jerks File...
UA = Ruby/1.8.7; previously: Ruby EventMachine [webmasterworld.com]
20 hits in 7 secs; no robots.txt; double-hits = alternated HEADs, GETs. All 403'd.

ec2-75-101-231-140.compute-1.amazonaws.com
08/14 19:25:17 /dir1/fileA.html
08/14 19:25:17 /dir2/fileB.html
08/14 19:25:18 /dir1/fileA.html
08/14 19:25:18 /dir2/fileB.html

ec2-72-44-58-113.compute-1.amazonaws.com
08/14 19:25:14 /dir1/fileC.html
08/14 19:25:16 /dir1/fileC.html
08/14 19:25:16 /dir1/fileD.html
08/14 19:25:17 /dir1/fileE.html
08/14 19:25:18 /dir1/fileD.html
08/14 19:25:18 /dir1/fileE.html

ec2-67-202-33-142.compute-1.amazonaws.com
08/14 19:25:11 /dir3/fileF.html
08/14 19:25:11 /dir2/fileG.html
08/14 19:25:12 /dir3/fileF.html
08/14 19:25:12 /dir2/fileG.html
08/14 19:25:13 /
08/14 19:25:14 /
08/14 19:25:16 /dir1/fileH.html
08/14 19:25:16 /dir1/fileG.html
08/14 19:25:17 /dir1/fileH.html
08/14 19:25:18 /dir1/fileG.html

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:25 pm on Aug 17, 2010 (gmt 0)

ec2-184-72-246-182.compute-1.amazonaws.com
Silverton/1.0 Crawler

robots.txt? NO

Previously... [webmasterworld.com...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:44 am on Aug 18, 2010 (gmt 0)

Twitter swarmer:

ec2-174-129-104-82.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (+http://flipboard.com/crawler)

robots.txt? NO

HEAD then GET to same html doc; 2 hits in 2 secs.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:49 am on Aug 24, 2010 (gmt 0)

I can only hope this one's a fake...

ec2-184-72-17-65.us-west-1.compute.amazonaws.com
Googlebot/2.1 (+http://www.google.com/bot.html)

robots.txt? NO

MxAngel



 
Msg#: 3828718 posted 9:20 am on Aug 27, 2010 (gmt 0)

Host: ec2-184-73-90-164.compute-1.amazonaws.com
Referer: -
Agent: Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com]


robots.txt? NO

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 3:20 pm on Aug 27, 2010 (gmt 0)

paul at page-store


"Paul" has been on AMAZON-EC2-3 for a while and has a very special place in my .htaccess file where only a few make it, for at least 2.5 years.

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 8:19 pm on Sep 2, 2010 (gmt 0)

Came across another AWS block today:

46.51.128.0 - 46.51.207.255 - Ireland (Amazon EU)

Full Amazon range (all of it blocked here) is full 46.51.128/17

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 9:37 pm on Sep 2, 2010 (gmt 0)

Thanks dstiles

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:03 am on Sep 11, 2010 (gmt 0)

Just spotted this crawling a link (twice, apparently) to one of my sites. Don't care much for the Host, of course. But am also less than happy to see the UA's "ru" localization: We have way, waaay too many problems with most Things Russian. Then again, it could just be a fake UA --

ec2-184-73-108-177.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)
09/10 22:41:39 /
09/10 22:43:36 /

robots.txt? NO

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 9:17 pm on Sep 11, 2010 (gmt 0)

Firefox 4? Still on 3.6.9 here.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 8:53 am on Sep 14, 2010 (gmt 0)

ec2-72-44-63-135.compute-1.amazonaws.com
SockrollBot/.01 (+http://www.sockroll.com/roll/SockrollBot)

robots.txt? NO

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 9:00 pm on Sep 22, 2010 (gmt 0)

(Note: I've munged the URL's in the UA's by adding a space and removing dots).

UA: LucidMedia ClickSense/4.0 (support@lucidmedia.com; http:// www lucidmedia com/)

IP: 174.129.118.nnn
rDNS: ec2-174-129-118-nnn.compute-1.amazonaws.com
robots.txt: Yes
obeys robots.txt: To be determined.
Purpose: their page doesn't explain, but probably something to do with advertising and click through's. Not sure why that would get me a visit as I don't advertise.

Old Thread: [webmasterworld.com...]

Project Honeypot
Geographic Location [United States] United States
Spider First Seen approximately 1 year, 6 months, 1 week ago
Spider Last Seen within 1 year, 6 months, 1 week
Spider Sightings 1 visit(s)
User-Agents seen with 1 user-agent(s)
Threat Rating (Not yet scored. Check again soon.)

User Agent Strings
SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; http:// www similarpages com; info at similarpages dot com)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:18 pm on Sep 27, 2010 (gmt 0)

Back to the Stone Age with this UA:

ec2-174-129-88-35.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)

robots.txt? NO

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 9:12 pm on Sep 27, 2010 (gmt 0)

@Pfui, If you really want stone age, try these ones
Mozilla/0.6 Beta (Windows)
Mozilla/0.91 Beta (Windows)
Mozilla/1.22 (compatible; MSIE 2.0; Windows 95)
Mozilla/2.0 (compatible; MSIE 3.02; Windows CE; 240x320)
Mozilla/3.0 (compatible; WebCapture 2.0; Auto; Windows)

This is from a bot that likes to scan my guest book looking for a page to post comments. That is just a few of about a 103 Fake UA's that this bot likes to use.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 4:31 pm on Oct 23, 2010 (gmt 0)

This all-too-real nasty app came by a bit ago:

ec2-184-73-182-114.compute-1.amazonaws.com
Wget/1.11.4 Red Hat modified

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 7:10 pm on Oct 26, 2010 (gmt 0)

ec2-75-101-222-156.compute-1.amazonaws.com
pzncrawl/1.63

robots.txt? NO

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 9:15 pm on Oct 27, 2010 (gmt 0)

Just discovered another Amazon IP range, allocated 2010-10-07:

50.16.0.0 - 50.19.255.255

"The activity ... from a dynamic hosting environment."

caribguy

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 5:48 am on Nov 1, 2010 (gmt 0)

Confirmed, just saw Netcraft using it.

CIDR: 50.16.0.0/14
NetName: AMAZON-EC2-8

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3828718 posted 6:45 am on Nov 1, 2010 (gmt 0)

Thanks dstiles, caribguy

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 6:21 pm on Nov 1, 2010 (gmt 0)

FWIW: Yet another instance of an AWS server running with/compromised by a botnet. Turkey this time: [projecthoneypot.org...] The first hit of each pair was self-referred to the same bot-verboten .cgi file, the second to root.

88.255.97.nn
Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:50:08
11/01 09:50:15

ec2-174-129-33-212.compute-1.amazonaws.com
Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:50:55
11/01 09:50:56

88.255.97.nn
Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)
11/01 09:51:43
11/01 09:51:54

So much for cloud security. Again.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3828718 posted 8:57 pm on Nov 6, 2010 (gmt 0)

Just discovered an AWS IP trying to get through using a CoreIX proxy.

The AWS Irish range 79.125.0.0-79 - 125.127.255 is blocked in IIS - ie I never see those IPs in my logs 'cause they are blocked before they can hit a web site.

The IP 79.125.106.203 attempted some hits today using 85.13.201.226 as a proxy, hitting three times using IE and firefox UAs about 30 seconds apart.

There was another identical (single) proxy hit a few days ago to a UK trap domain using an IE UA beginning "IE 7 x Mozilla 4/0..." where x was a UTF "D"-ish character.

There is no log of such an access last month using CoreIX. I haven't gone back before that.

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 11:38 pm on Nov 8, 2010 (gmt 0)

UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
IP: 50.16.85.nnn
rDNS: ec2-50-16-85-nnn.compute-1.amazonaws.com.

Robots.txt: No

According to their web site title "Anti-Phising and PCI Security Services" .
Their website doesn't mention NetcraftSurveyAgent that I could find.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3828718 posted 2:43 am on Nov 9, 2010 (gmt 0)

NetcraftSurveyAgent has been around for a few years at least, originally hailing from lager.netcraft.com using the same UA and typically HEAD-requesting:

/icons/apache_pb.gif

That's path's above the webspace but the file is accessible (and the URI not easily blocked, imho), because the file's an Apache OS image. I think it's really, really sneaky Netcraft probes that way.

Also, Netcraft's been bot-running from AWS since, oh, early this year. Regardless of host, it never asks for robots.txt. Then again, the two robots.txt files they use (search their site for: robots.txt) are syntactically incorrect/ineffectual.

[edited by: Pfui at 2:45 am (utc) on Nov 9, 2010]

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 [7] 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved