homepage Welcome to WebmasterWorld Guest from 23.20.63.27
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 7 8 [9] 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

keyplyr




msg:4325713
 12:03 am on Jun 14, 2011 (gmt 0)

@ dstiles Just a FYI concerning the IP rage config: 174.129/16

That does not work on all unix/apache set-ups. Out of the 3 hosted servers I use, 2 I must write it like this: 174.129.0.0/16.

Just though I'd post this for those who mistakenly cut'n paste without doing their research.

And agreed, good to see you around again, pfui.

tangor




msg:4325798
 6:39 am on Jun 14, 2011 (gmt 0)

pfui! Dear Heart, happy to see you posting again! Missed you.

wilderness




msg:4325911
 1:30 pm on Jun 14, 2011 (gmt 0)

keyplr,
Regarding "174.129.0.0/16" ?

FWIW, all that trailing crap is not necessary for entire Class Groups (in this example a Class B).

174.129. will function the same as 174.129.0.0/16

dstiles




msg:4326140
 9:19 pm on Jun 14, 2011 (gmt 0)

I was not saying what the "code" should be, keyplyr, as I don't use .htaccess. I was just reporting the IP class. My own system requires a full range - eg 174.129.0.0 - 174.129.255.255. Tedious but not as bad as you might suppose. One day I may automate data entry to deal with /16 and such. :)

keyplyr




msg:4326279
 7:22 am on Jun 15, 2011 (gmt 0)

174.129. will function the same as 174.129.0.0/16 - wilderness

Thanks, I'm aware of that. The reason I write the full code is so I know the breadth of the block, especially when the host later splits the range to include new companies. Not so obvious in the above examples, but in more specific blocks very helpful.

Pfui




msg:4340406
 2:07 am on Jul 17, 2011 (gmt 0)

ec2-174-129-79-111.compute-1.amazonaws.com
HTTP_Request2/2.0.0RC1 (http://pear.php.net/package/http_request2) PHP/5.3.2-1ubuntu4.9

robots.txt? NO

Pfui




msg:4340407
 2:21 am on Jul 17, 2011 (gmt 0)

ec2-174-129-171-219.compute-1.amazonaws.com
PostPost/1.0 (+http://postpo.st/crawlers)

robots.txt? Yes

Cutesy URL TLD .st = Saint Vincent and the Grenadines. Twitter-traveler. More info in "PostPost" thread.

Pfui




msg:4340583
 8:39 pm on Jul 17, 2011 (gmt 0)

And now, from the "Twitter-Swarmer Still Won't Take "No" for an Answer" department:

In 25 seconds, the following three hosts used two different-version UAs to hit the same file four times (& botbait three times) via GET and HEAD and all despite 302s and 403s:

ec2-50-16-177-215.compute-1.amazonaws.com
ec2-107-20-14-9.compute-1.amazonaws.com
ec2-174-129-57-129.compute-1.amazonaws.com

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (FlipboardProxy/0.0.5; +http://flipboard.com/browserproxy)
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (FlipboardProxy/1.1; +http://flipboard.com/browserproxy)

Oh, almost forgot...

All that came after successfully getting -- and getting fully Disallowed in -- robots.txt in the first second. Sheesh, if you ain't gonna heed it, why bother to read it?

FWIW, I chide flipboard.com and AWS for such shoddy, erm, webmanship. (Read: "What jerks.")

See also, from Aug. 17th of last year: "flipboard" [webmasterworld.com...]

Pfui




msg:4340590
 9:05 pm on Jul 17, 2011 (gmt 0)

Repeat offender Twitter-swarmer.

ec2-184-73-218-115.compute-1.amazonaws.com
Strawberryj.am

robots.txt? NO

Another cutesy, and poorly coded, UA. TLD .am = Armenia. See also "Strawberryj.am" thread.

dstiles




msg:4342315
 5:00 pm on Jul 21, 2011 (gmt 0)

New Amazon range to block, first seen today:

107.20.0.0 - 107.23.255.255

Initial access on 107.20.15.169 with no UA. A good start to a new IP range.

Neither my Linux Network Tools Whois nor robtex could resolve the IP range, although it was registered in March this year. Nice to know the internet and its tools are up to date. :(

Arin's DNS for the range says the following, for anyone wanting to report Amazon's many badly-behaved bots...

---------------------
The activity you have detected originates from a dynamic hosting environment.
For fastest response, please submit abuse reports at [aws-portal.amazon.com...]
For more information regarding EC2 see:
[ec2.amazonaws.com...]
All reports MUST include:
* src IP
* dest IP (your IP)
* dest port
* Accurate date/timestamp and timezone of activity
* Intensity/frequency (short log extracts)
* Your contact details (phone and email) Without these we will be unable to identify the correct owner of the IP address at that point in time.
---------------------

keyplyr




msg:4342622
 8:38 am on Jul 22, 2011 (gmt 0)

Thanks dstiles

Pfui




msg:4343833
 3:04 am on Jul 26, 2011 (gmt 0)

ec2-184-72-178-50.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Bender; http://benderthewebrobot.tumblr.com)
robots.txt? Yes

See also: Bender the web crawler [webmasterworld.com...]

dstiles




msg:4344206
 8:03 pm on Jul 26, 2011 (gmt 0)

I wouldn't ever see that one. 184.72/15 is blocked in the IIS "firewall".

Pfui




msg:4344822
 4:20 am on Jul 28, 2011 (gmt 0)

ec2-46-137-129-41.eu-west-1.compute.amazonaws.com
Zend_Http_Client
robots.txt? NO

See also just-started "Zend_Http_Client" thread.

keyplyr




msg:4344905
 9:33 am on Jul 28, 2011 (gmt 0)

RE: ec2-46-137-129-41.eu-west-1.compute.amazonaws.com

I've had 46.137.0.0/16 blocked but have been unsure if there's a more specific range. Can't find much info. Anyone?

See also just-started "Zend_Http_Client" thread.

Looked, didn't find it.

Pfui




msg:4344984
 2:09 pm on Jul 28, 2011 (gmt 0)

I started a separate "Zend_Http_Client" [webmasterworld.com...] thread right before I posted in this one. As of this writing, the standalone's still pending mod approval.

dstiles




msg:4345176
 8:47 pm on Jul 28, 2011 (gmt 0)

keyplr - I blocked the whole /16 - it's AWS in Ireland.

keyplyr




msg:4345287
 1:20 am on Jul 29, 2011 (gmt 0)

Thanks dstiles

dstiles




msg:4345591
 9:22 pm on Jul 29, 2011 (gmt 0)

From the zdnet security blog today...

"Amazon's cloud services systematically exploited by cybercriminals

"Security researchers from Kaspersky Labs have spotted yet another SpyEye crimeware variant using Amazonís Simple Storage Service (Amazon S3) for command and control purposes.

"...Does crimeware in the cloud have a future? Most certainly..."

Pfui




msg:4348343
 6:14 pm on Aug 5, 2011 (gmt 0)

I don't know which is worse, that this UA is fake, or real...

ec2-184-73-173-106.compute-1.amazonaws.com
ia_archiver

So much for reading and heeding robots.txt...

08/05 nn:34:50 /robots.txt 200
08/05 nn:34:50 /sitemap.xml 403
08/05 nn:34:50 /sitemap_index.xml 403
08/05 nn:34:51 /sitemap.xml.gz 403
08/05 nn:34:51 /sitemap_index.xml.gz 403
08/05 nn:34:52 /sitemap.txt 403
08/05 nn:34:52 /sitemap.rss 403
08/05 nn:34:53 /sitemap.atom 403
08/05 nn:34:53 / 403

Jerks.

keyplyr




msg:4348424
 9:56 pm on Aug 5, 2011 (gmt 0)

Several new IP Ranges are listed on the AWS blog itself: https://forums.aws.amazon.com/ann.jspa?annID=1097

Pfui




msg:4348447
 11:40 pm on Aug 5, 2011 (gmt 0)

1.) For folks denying the IP cesspool that is AWS, I hope you're also using --

deny from amazonaws

-- and/or mod_rewriting amazonaws because AWS does not include ALL of their 'current public address ranges' on keyplyr's handy link. For example:

ec2-50-17-0-111.compute-1.amazonaws.com
a.k.a. Comment Spammer 50.17.0.111 [projecthoneypot.org...]
a.k.a. AWS Net Range 50.16.0.0 - 50.19.255.255

If you don't do reverse IP lookups...

2.) The following list -- numerical, so I can easily eyeball entries in.htaccess -- consists of the IPs in AWS's 07-29-11 geographically-arrayed announcement, minus 50.17.0.0/16 (& unknown others of AWS's gazillion IPs). You might want to add 50.17.0.0/16, or even better, consolidate all the 50s into:

deny from 50.16.0.0/14

## 07-29-11: Amazon EC2 Public IP Ranges
## https://forums.aws.amazon.com/ann.jspa?annID=1097
deny from 46.51.128.0/18
deny from 46.51.192.0/20
deny from 46.51.216.0/21
deny from 46.51.224.0/19
deny from 46.137.0.0/17
deny from 46.137.128.0/18
deny from 46.137.224.0/19
deny from 50.16.0.0/15
deny from 50.18.0.0/16
deny from 50.19.0.0/16
deny from 67.202.0.0/18
deny from 72.44.32.0/19
deny from 75.101.128.0/17
deny from 79.125.0.0/17
deny from 103.4.8.0/21
deny from 107.20.0.0/15
deny from 122.248.192.0/18
deny from 174.129.0.0/16
deny from 175.41.128.0/18
deny from 175.41.192.0/18
deny from 176.32.64.0/19
deny from 176.34.128.0/17
deny from 184.72.0.0/18
deny from 184.72.64.0/18
deny from 184.72.128.0/17
deny from 184.73.0.0/16
deny from 204.236.128.0/18
deny from 204.236.192.0/18
deny from 216.182.224.0/20
##

dstiles




msg:4348719
 9:52 pm on Aug 6, 2011 (gmt 0)

I have a slightly different list which excluded 103 (only recently released for use) and 176, both of which now added - many thanks!

Total list, including ranges shown in DNS as Amazon (not AWS)...

8.18.144.0 - 8.18.145.255
46.51.128.0 - 46.51.255.255
46.137.0.0 - 46.137.255.255
50.16.0.0 - 50.19.255.255
67.202.0.0 - 67.202.63.255
72.44.32.0 - 72.44.63.255
75.101.128.0 - 75.101.255.255
79.125.0.0 - 79.125.127.255
87.238.80.0 - 87.238.87.255
103.4.8.0 - 103.4.15.255
107.20.0.0 - 107.23.255.255
122.248.192.0 - 122.248.255.255
174.129.0.0 - 174.129.255.255
175.41.128.0 - 175.41.255.255
176.32.64.0 - 176.32.127.255
176.34.128.0 - 176.34.255.255
184.72.0.0 - 184.73.255.255
199.255.192.0 - 199.255.195.255
204.236.128.0 - 204.236.255.255
207.171.160.0 - 207.171.191.255
216.182.224.0 - 216.182.239.255

Pfui




msg:4351053
 3:30 am on Aug 13, 2011 (gmt 0)

Among other not-okay things, note the 400 error (bad request/syntax):

ec2-204-236-194-99.compute-1.amazonaws.com - - [1n/Aug/2011:12:34:56 -0700] "HEAD HTTP/1.1" 400 - "-" "-"

Pfui




msg:4353161
 4:44 pm on Aug 18, 2011 (gmt 0)

ec2-184-73-8-96.compute-1.amazonaws.com
AlexionResearchBot/Nutch-1.3

robots.txt? Yes

Pfui




msg:4353644
 11:48 pm on Aug 19, 2011 (gmt 0)

Twitter swarmer/reader/whatever.

ec2-175-41-196-238.ap-northeast-1.compute.amazonaws.com
Crowsnest/0.5 (+http://www.crowsnest.tv/)

robots.txt? NO

See also: Crowsnest [webmasterworld.com...]

dstiles




msg:4354865
 2:36 pm on Aug 24, 2011 (gmt 0)

Another IP range detected today...

72.21.192.0 - 72.21.223.255

The IP 72.21.217.n was used as a proxy for a UA of MSIE-6, which is in itself highly deprecated. Headers were consistent with either a battened-down proxy or a bot.

The Forwarded-For IP was 208.53.158.nnn which is an IP belonging to FDC Servers - already banned because it... um... er... servers?

208.53.128.0 - 208.53.191.255 (and others)

Pfui




msg:4354958
 7:31 pm on Aug 24, 2011 (gmt 0)

A quick scan of my notes for that 72.21.217. shows that last June, when Amazon-owned IMDb confirmed a site-related entry, this combo hit root:

72.21.217.0
Mozilla/4.0

Then in July, another confirmation, and a neighboring IP plus a bad UA --

72.21.217.64
libwww-perl/5.805

(Problem is, I never know what Amazon/IMDb/AWS is going to send, or when. So either I leave the door wide open all the time to anything they want and anything they use, or -- not.)

Pfui




msg:4360417
 9:06 am on Sep 9, 2011 (gmt 0)

FWIW... Hit 20 minutes apart, faking two really, really old UAs:

ec2-46-137-71-213.eu-west-1.compute.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1

ec2-75-101-129-73.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.8.1) Gecko/20061010 Firefox/2.0

No robots.txt, of course.

keyplyr




msg:4363096
 11:27 pm on Sep 15, 2011 (gmt 0)

204.236.143.78 percbotspider

rDNS: ec2-204-236-143-78.us-west-1.compute.amazonaws.com
204.236.128.0 to 204.236.255.255
204.236.128.0/17

robots.txt: no

santapaws




msg:4365230
 9:11 am on Sep 21, 2011 (gmt 0)

dstiles thanks for your list. You wouldn't happen to have that list ready to go with cidr ranges by any chance? :)

<added>
ok, i worked it out, i have:</added>
8.18.144.0/23
46.51.128.0/17
46.137.0.0/16
50.16.0.0/14
67.202.0.0/18
72.21.192.0/19
72.44.32.0/19
75.101.128.0/17
79.125.0.0/17
87.238.80.0/21
103.4.8.0/21
107.20.0.0/14
122.248.192.0/18
174.129.192.0/18
175.41.128.0/17
176.32.64.0/18
176.34.128.0/17
184.72.0.0/15
199.255.192.0/22
204.236.128.0/17
207.171.128.0/18
216.182.224.0/20

This 278 message thread spans 10 pages: < < 278 ( 1 2 3 4 5 6 7 8 [9] 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved