homepage Welcome to WebmasterWorld Guest from 54.237.98.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 88 message thread spans 3 pages: < < 88 ( 1 [2] 3 > >     
Amazon AWS Hosts Bad Bots
amazonaws.com
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 12:45 am on Sep 30, 2011 (gmt 0)

1.) Back in 2008, I noticed a lot of bad bots hailing from amazonaws.com and by January, 2009, I started a thread about what hid behind that early cloud:

amazonaws.com plays host to wide variety of bad bots [webmasterworld.com...]

Since that time, 270-plus reports/messages further document that the Amazon AWS Host name and Amazon AWS's countless IPs continue to be what forum mod IncrediBILL aptly termed:

"Cesspool."

This thread continues the saga of amazonaws.com and its spawn.

2.) The AWS cesspool is home to countless hundreds of bots, the vast majority of which ignore robots.txt. Home to hundreds more bots cloaked as regular UAs. Home to infected machines and bad programming, and all the ills to others that cloud anonymity affords.

And in recent weeks, home to bots with no UA at all... [webmasterworld.com...] Note the double-quotes at the end where a UA, or at least a hyphen, should be:

ec2-50-17-87-218.compute-1.amazonaws.com - - [00/Sep/2011:00:00:00] "GET /dir/filename.html HTTP/1.1" 403 1471 "-" ""

Today, the 'blank bot' -- what I've started thinking of as the AWSbot -- was the most frequent AWS 'visitor' to my main site. Four Hosts, four hits to different files, four 403s. robots.txt? NO

 

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 10:57 pm on Nov 28, 2011 (gmt 0)

Note AWS-related prefix below --

ec2-50-17-110-45.compute-1.amazonaws.com
Stellar/0.1 stellar.io

robots.txt? NO

Stellar.io
= 107.20.176.132
= ec2-107-20-176-132.compute-1.amazonaws.com
= "Stellar.io is a domain controlled by four domain name servers at awsdns-22.org, awsdns-63.net, awsdns-09.com and awsdns-40.co.uk." [robtex.com...]

Noteworthy: awsdns- (Includes employee pet sites/projects?)
Blockworthy? [robtex.com...]
Beats me.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 10:57 pm on Nov 29, 2011 (gmt 0)

Got hits from shopwiki bot today, looking genuine but from an AWS range.

IP: 107.20.38.235
UA: ShopWiki/1.0 ( +http://www.shopwiki.com/wiki/Help:Bot)

If this is genuine then goodbye shopwiki.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 12:36 am on Nov 30, 2011 (gmt 0)

I allowed them for a while since I do sell products, but they were crawling the entire site every day of the week, so I now stop them at robots.txt.

So far Shopwiki has always obeyed the deny in robots.txt.

dstiles - one of the (many) irritating dynamics of AWS is the unaccountability. I'm interested in how you would confirm this is a valid Shopwiki bot if coming from an AWS range?

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 10:40 pm on Nov 30, 2011 (gmt 0)

You're right, but I did say "looking genuine" and "if this is genuine". Without a proper rDNS I cannot be sure it's valid except for the crawl pattern, which looks as extensive as others.

My own findings are that shopwiki does not always give a proper rDNS. Of three ranges I get bots from, only one seems to resolve and that is their acknowledged IP range. Two that do not are in Hurricane and XO ranges; they seem to work as expected and not as forgeries.

Yes, the bot crawls every day, as do other SEs that have the capacity (G, B etc). The proper way is to change the cache header times (expire and cachecontrol) from (eg) 24 hours to 240 hours. That should fix the problem. If not, ask the SE why not (NB: it may still crawl to check the timing but should be happy with the header and not reload the complete page.

The above is partly conjectural for SEs as my sites all ask for 24 hour refresh periods. Does anyone have further onfo on this?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 9:46 pm on Dec 1, 2011 (gmt 0)

FWIW... Baby-botnet bedfellows:

115.94.114.219 <= S. Korea: Threat 22: [projecthoneypot.org...]
12/01 11:09:02

115.85.145.69 <= Taiwan: Threat 21 [projecthoneypot.org...]
12/01 11:09:10

85.248.69.124 <= Slovakia: Threat 27 [projecthoneypot.org...]
12/01 11:09:16

ec2-50-19-13-173.compute-1.amazonaws.com <= Threat 26 [projecthoneypot.org...]
12/01 11:09:35

Faked UA of choice today:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; .NET CLR 1.1.4322; PeoplePal 6.2)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 7:37 pm on Dec 3, 2011 (gmt 0)

This just in... Kindle Fire's Silk hit html files, graphics, and favicon from amazonaws.com AND, seconds later, from Comcast: [webmasterworld.com...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 1:34 pm on Dec 11, 2011 (gmt 0)

I reckon this is G spoofery:

ec2-50-17-222-103.compute-1.amazonaws.com
GoogleHttpClient

0n:11:38 (403)
0n:11:38 (403)
0n:11:39 /robots.txt
0n:11:40 (403)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 8:39 pm on Dec 19, 2011 (gmt 0)

I don't understand the connections, but thought I'd report in because of the surprise tie-in(s) I found after noting a hit from a .zae.cc domain:

184.73.230.215
= zae.cc
= publisher.smowtion.com
= ec2-184-73-230-215.compute-1.amazonaws.com
= 184.73.0.0/16 Amazon IAD prefix

You'll see that IP and many, many, many more via this extensive (albeit based in Turkey) eye-opener --

Sites published on ISP: Amazon.com, Inc. (598) [livepageranks.com...]
(imagine the data Amazon AWS gets a crack at from hosting all those!)

-- including aggressive Twitter-swarmer bot-runner "paper.li", as in:

ec2-50-19-13-169.compute-1.amazonaws.com
Mozilla/5.0 (compatible; PaperLiBot/2.1; httpt://support.paper.li/entries/20023257-what-is-paper-li)
robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 7:33 pm on Jan 6, 2012 (gmt 0)

Same-second hits from:

ec2-75-101-246-218.compute-1.amazonaws.com
ec2-107-20-68-30.compute-1.amazonaws.com

Quora Link Preview/1.0 (http://www.quora.com)

robots.txt? NO

More details in the just-posted "Quora Link Preview" [webmasterworld.com...]

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 6:40 pm on Jan 12, 2012 (gmt 0)


UA: SkimBot/1.0 (www.skimlinks.com <dev@skimlinks.com>)
robots.txt: yes

Coming from these AWS ranges:

ec2-176-34-203-24.eu-west-1.compute.amazonaws.com
ec2-50-16-90-19.compute-1.amazonaws.com

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 5:01 pm on Jan 17, 2012 (gmt 0)

New (to me) Amazon range, registered September 2011...

NetRange: 23.20.0.0 - 23.23.255.255
CIDR: 23.20.0.0/14
OriginAS: AS16509
NetName: AMAZON-EC2-USEAST-10
Comment: The activity you have detected originates from a dynamic hosting environment.

Single hit today from bot calling itself linkdex.com/v2.0. No idea about robots.txt without delving deeper. Page was an unusual mid-site one.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 7:01 pm on Jan 17, 2012 (gmt 0)


Thanks dstiles, I didn't have that range.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 7:17 am on Jan 22, 2012 (gmt 0)

1.) Name change. Note space before closing paren:

SkimBot/1.0 (www.skimlinks.com )

That was from AWS CIDRs:

46.137.0.0/17
204.236.128.0/18

2.) 'Nother bot (184.73.0.0/16 again):

ec2-184-73-47-231.compute-1.amazonaws.com
PArchiveCrawler/1.0 (PArchiveCrawler)

robots.txt? Yes

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 8:30 am on Jan 22, 2012 (gmt 0)



2.) 'Nother bot (184.73.0.0/16 again):

ec2-184-73-47-231.compute-1.amazonaws.com
PArchiveCrawler/1.0 (PArchiveCrawler)

I use a wider ban:

184.72.0.0 - 184.73.255.255
184.72.0.0/15

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 9:02 pm on Jan 22, 2012 (gmt 0)

Agreed!

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 9:44 pm on Jan 22, 2012 (gmt 0)

Speaking of wider bans... Recently I spent some time tracking (the) Amazon back to its sources. With help from robtex [robtex.com...] --

## u1.amazonaws.com; [robtex.com...]

deny from 156.154.64.0/24

## u2.amazonaws.com; [robtex.com...]

deny from 156.154.65.0/24

## u3.amazonaws.com / u3.amazonaws.info; [robtex.com...]
## u3.amazonaws.info and u3.amazonaws.com point to 156.154.66.10. Amazonaws.info, amazonaws.org, amazonaws.com, geo.amazonaws.com, compute-1.amazonaws.com and at least six other hosts use 156.154.66.10 as a name server

deny from 156.154.66.0/24

## u4.amazonaws.com; [robtex.com...]

deny from 156.154.67.0/24

## u5.amazonaws.com / u5.amazonaws.org; [robtex.com...]

deny from 156.154.68.0/24

## u6.amazonaws.com / u6.amazonaws.org; [robtex.com...]

deny from 156.154.69.0/24

Just more drops in the literally world-wide AWS bucket.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 10:11 pm on Jan 22, 2012 (gmt 0)


156.154.64.0/22
156.154.68.0/23


Interesting Pfui. AWS may be buying up unused D ranges all over.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 9:03 am on Feb 2, 2012 (gmt 0)

And they're infected all over. Literally:

IP Location: Singapore Bedok Amazon Web Services Elastic Compute Cloud Ec2 Sg
122.248.192.0/18

ec2-122-248-241-218.ap-southeast-1.compute.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.00

Project Honey Pot Threat Rating 26 as of this writing:

= 122.248.241.218 [projecthoneypot.org...]
= .mapmyindia.com [robtex.com...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 5:47 pm on Feb 4, 2012 (gmt 0)

ec2-23-20-17-9.compute-1.amazonaws.com
Mozilla/5.0 (TREC-KBA-Bot http://www.mit.edu/~jrf/knowledge-base-acceleration/bot/)

robots.txt? NO

Two hits in one sec. UA info is a dead-end. One dir up has info but not about bot.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 9:16 pm on Feb 21, 2012 (gmt 0)

This same Host came through twice, hours apart, so it may be dedicated. Note the exact UA that appears in logs, backslashed/escaped quotes and all:

ec2-107-20-34-210.compute-1.amazonaws.com

\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7\"

In a word: Idiotic.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 9:24 pm on Feb 21, 2012 (gmt 0)

And now one with no UA and a faked instapaper'esque-service referrer. Hard to say who/what's doing what:

ec2-75-101-136-170.compute-1.amazonaws.com
-

Fake REF: Magazinify - Daemon

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 5:49 pm on Feb 23, 2012 (gmt 0)

Hit from a new (to me) Amazon range today:

176.34.0.0/16
Amazon Ireland (IE)
Various sub-ranges registered to FR and NL.

No indication in DNS records that this is cloud but the hit I received was from 88.208.193.nnn (Fasthosts dedicated servers) using an Amazon IP as a proxy. Amazon range appears to be named servers. No obvious registration dates.

Chacking further: I had the range 176.34.128.0/17 blocked as being Amazon but not the first half.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4368965 posted 3:43 am on Mar 3, 2012 (gmt 0)

cloudacl.com

ec2-72-44-41-63.compute-1.amazonaws.com
CloudACL/Nutch-1.4

robots.txt? Yes

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 5:43 pm on Apr 1, 2012 (gmt 0)

And another EC2, registered last August, updated four weeks ago, first hit today with the UA:

UnwindFetchor/1.0 (+http://www.gnip.com/)

184.169.128.0 - 184.169.255.255

Are they ever going to stop?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 6:24 pm on Apr 1, 2012 (gmt 0)

Are they ever going to stop?


perhaps when hell freezes over.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 7:48 pm on Apr 1, 2012 (gmt 0)

I don't think they issue weather forecasts for that place. :(

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4368965 posted 9:43 pm on Apr 1, 2012 (gmt 0)

Fresh from the press
Today :

GSLFbot - 174.129.106.148

robots.txt : yes

Couldn't find itself (first visit, what did it expect?)

Went on to the home page where it got a 403, all Amazon ranges are blocked

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4368965 posted 10:06 pm on Apr 1, 2012 (gmt 0)

I didn't "heads up", because there was some recent discussion of the IP Range:

107.20.69.233 - - [01/Apr/2012:14:31:35 +0100] "GET /robots.txt HTTP/1.1" 200 2536 "" "GSLFbot"
107.20.69.233 - - [01/Apr/2012:14:31:35 +0100] "GET / HTTP/1.1" 301 234 "" "GSLFbot"
107.20.69.233 - - [01/Apr/2012:14:31:35 +0100] "GET / HTTP/1.1" 403 533 "" "GSLFbot"

Love2Blog

5+ Year Member



 
Msg#: 4368965 posted 7:42 pm on Apr 4, 2012 (gmt 0)

I've been getting hammered on several site by this GSLF bot, it has a lot of IP's, they keep changing.

23.20.138.0/24
50.16.28.0/24
184.73.75.0/24
107.20.26.207 (today)

I have been using quick deny on my server to block the IP's individually but this is taking too much of my time.

How can I block the user agent GSLF? I think it ignores robots.

Is this the best way to block, meaning by blocking the user agent?

Thank you

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4368965 posted 7:59 pm on Apr 4, 2012 (gmt 0)

Block every IP range belonging to Amazon. That's what many of us here are doing and it blocks a LOT of bots. You can find all the ranges we know about in this thread.

Love2Blog

5+ Year Member



 
Msg#: 4368965 posted 8:28 pm on Apr 4, 2012 (gmt 0)

Block every IP range belonging to Amazon


Are you guys just blocking using IP Deny in shared hosting or quick deny in the Firewall on a VPS server?

About the ranges, is /24 the one that includes the entire range?

I see some people using /18, /15..etc?

Thanks

This 88 message thread spans 3 pages: < < 88 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved