homepage Welcome to WebmasterWorld Guest from 54.235.39.132
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 88 message thread spans 3 pages: 88 ( [1] 2 3 > >     
Amazon AWS Hosts Bad Bots
amazonaws.com
Pfui




msg:4368967
 12:45 am on Sep 30, 2011 (gmt 0)

1.) Back in 2008, I noticed a lot of bad bots hailing from amazonaws.com and by January, 2009, I started a thread about what hid behind that early cloud:

amazonaws.com plays host to wide variety of bad bots [webmasterworld.com...]

Since that time, 270-plus reports/messages further document that the Amazon AWS Host name and Amazon AWS's countless IPs continue to be what forum mod IncrediBILL aptly termed:

"Cesspool."

This thread continues the saga of amazonaws.com and its spawn.

2.) The AWS cesspool is home to countless hundreds of bots, the vast majority of which ignore robots.txt. Home to hundreds more bots cloaked as regular UAs. Home to infected machines and bad programming, and all the ills to others that cloud anonymity affords.

And in recent weeks, home to bots with no UA at all... [webmasterworld.com...] Note the double-quotes at the end where a UA, or at least a hyphen, should be:

ec2-50-17-87-218.compute-1.amazonaws.com - - [00/Sep/2011:00:00:00] "GET /dir/filename.html HTTP/1.1" 403 1471 "-" ""

Today, the 'blank bot' -- what I've started thinking of as the AWSbot -- was the most frequent AWS 'visitor' to my main site. Four Hosts, four hits to different files, four 403s. robots.txt? NO

 

lucy24




msg:4368974
 1:42 am on Sep 30, 2011 (gmt 0)

the 'blank bot' -- what I've started thinking of as the AWSbot

Gee, that's funny. I always think of it as the faviconbot ;)

How 'bout the new browser [webmasterworld.com]?

We sought from the start to tap into the power and capabilities of the AWS infrastructure

Now there's a sales pitch to make your blood run cold. And, as noted in that thread, it means messing about with your Allows and Denys so you don't end up locking out unsuspecting humans.

Pfui




msg:4368979
 1:52 am on Sep 30, 2011 (gmt 0)

I didn't want to mix up AWS bad bot sitings/reports in this thread with discussions of AWS (ww)world domination, Amazon's new Silk and Fire, etc. Check out the just-posted:

Amazon AWS gunning for Google? [webmasterworld.com...]

Pfui




msg:4372153
 3:10 am on Oct 8, 2011 (gmt 0)

Two hits to html files, ~15 secs apart.

ec2-50-19-197-197.compute-1.amazonaws.com
HTTP_Request2/2.0.0RC1 (http://pear.php.net/package/http_request2) PHP/5.3.2-1ubuntu4.9

robots.txt? NO

Pfui




msg:4372157
 3:26 am on Oct 8, 2011 (gmt 0)

Not all of AWS's UAs are obvious bots:

ec2-184-72-188-54.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729; .NET4.0E)

robots.txt? NO

Pfui




msg:4372403
 4:00 pm on Oct 9, 2011 (gmt 0)

ec2-184-73-116-52.compute-1.amazonaws.com
Mozilla

robots.txt? NO

ec2-50-19-197-197.compute-1.amazonaws.com
HTTP_Request2/2.0.0RC1 (http://pear.php.net/package/http_request2) PHP/5.3.2-1ubuntu4.9

robots.txt? NO

Pfui




msg:4375261
 12:24 am on Oct 17, 2011 (gmt 0)

Today's worst AWS assault:

10 amazonaws.com servers
=> 26 unique, non-contiguous .html files, 1 .cgi file, 0 robots.txt
=> 27 403s in 9 secs

FWIW, sorted by server (per log program) thus times overlap. All ostensibly using:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1

ec2-50-18-13-33.us-west-1.compute.amazonaws.com
07:07:16 /dir2/file.html

ec2-50-18-27-118.us-west-1.compute.amazonaws.com
07:07:16 /dir/file27.html
07:07:16 /dir/file47.html
07:07:16 /dir1/file11.html

ec2-184-72-19-151.us-west-1.compute.amazonaws.com
07:07:14 /dir/file30.html
07:07:15 /dir/file42.html
07:07:16 /dir/file52.html

ec2-50-18-140-3.us-west-1.compute.amazonaws.com
07:07:14 /dir/file25.html
07:07:15 /dir/file38.html
07:07:16 /dir/file45.html

ec2-204-236-189-32.us-west-1.compute.amazonaws.com
07:07:13 /dir/file29.html
07:07:13 /dir/file13.html
07:07:14 /dir/file07.html
07:07:15 /dir/file41.html
07:07:15 /dir/file40.html
07:07:16 /dir/file48.html

ec2-50-18-85-139.us-west-1.compute.amazonaws.com
07:07:12 /dir2/dir/file.cgi
07:07:16 /dir/file51.html
07:07:16 /dir3/file.html

ec2-50-18-30-123.us-west-1.compute.amazonaws.com
07:07:09 /dir/file14.html
07:07:14 /dir/file19.html
07:07:16 /dir/file49.html

ec2-204-236-175-96.us-west-1.compute.amazonaws.com
07:07:08 /dir4/file.html

ec2-204-236-181-50.us-west-1.compute.amazonaws.com
07:07:07 /dir/file08.html

ec2-184-72-10-186.us-west-1.compute.amazonaws.com
07:07:07 /
07:07:15 /dir/file35.html
07:07:15 /dir/file32.html

##

Pfui




msg:4375290
 2:32 am on Oct 17, 2011 (gmt 0)

Must be my lucky day for hits from 50.18. --

ec2-50-18-23-16.us-west-1.compute.amazonaws.com
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; MALC)

16:42:59 /
16:43:00 /index.php
16:43:00 /index.php
16:43:01 /index.html
16:43:02 /index.html

Pure probe. There are no files by those names in that directory.

Pfui




msg:4376548
 2:12 pm on Oct 19, 2011 (gmt 0)

Two seconds apart to the same rarely directly-hit file. Coincidence?

ec2-204-236-161-233.us-west-1.compute.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com)

02:38:30 /dir/filename.html
robots.txt? NO

ec2-50-16-74-139.compute-1.amazonaws.com
Mozilla/5.0 (compatible; Topicmarks/1.0)

02:38:32 /dir/filename.html
robots.txt? NO

Diffbot (old-timer): [google.com...]
Topicmarks (just posted): [webmasterworld.com...]

Staffa




msg:4377589
 10:39 am on Oct 21, 2011 (gmt 0)

I had a visit from a log spammer coming from a new (to me) aws range : 107.20.0.0 - 107.23.255.255

Though it's not a crawler per se I thought I'd mention the range.

Pfui




msg:4377593
 10:53 am on Oct 21, 2011 (gmt 0)

What were its IP and UA, please? TIA

Staffa




msg:4377624
 1:10 pm on Oct 21, 2011 (gmt 0)

IP : 107.22.51.16
UA : Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0)

and, irony, log spamming for a web site for webmasters

Pfui




msg:4377898
 1:09 am on Oct 22, 2011 (gmt 0)

Thanks for the details, Staffa.
---
This next sighting makes sense seeing as how Amazon owns Alexa:

ec2-174-129-237-157.compute-1.amazonaws.com
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)

robots.txt? Yes

keyplyr




msg:4377923
 2:35 am on Oct 22, 2011 (gmt 0)

I've now completed the final step to leading a 100% Amazon free life style. A very liberating feeling ;)

Since the events a few months ago when Amazon abandoned their California sale affiliates (causing me a long week's work to re-architecture 3 good size web sites) added to the never-ending AWS nuisance, added to bogus Alexa ranking practices, added to the announcement that the Amazon market place would no longer give A-Z guarantees beyond 5 events, added to the rate increase w/ Amazon CC, added to their unwillingness to credit my card when one of their vendors reneged on a sale, ad infinitum...

All AWS IP ranges blocked, all Amazon IP ranges blocked, all Alexa IP ranges blocked, accounts of any Amazon affiliates doing business with us closed, all Amazon customer accounts closed/deleted, all contact info, browser favorites and any other connection to Amazon now deleted.

[edited by: keyplyr at 2:46 am (utc) on Oct 22, 2011]

dstiles




msg:4378128
 9:44 pm on Oct 22, 2011 (gmt 0)

Well done! Next week, google. :)

Pfui




msg:4379293
 7:45 pm on Oct 25, 2011 (gmt 0)

ec2-184-72-115-86.compute-1.amazonaws.com
DuckDuckPreview/1.0; (+http://duckduckgo.com/duckduckpreview.html)

robots.txt? NO

Previously, about DuckDuckBot: [webmasterworld.com...]

The UA's URL says they "grab pages on behalf of our users and display to them parts of those pages most relevant to their queries." Not. DuckDuckGo's hair-splitting 'not crawler, not spider' claims to the contrary, that AWS bot hit was not a real-time query "user."

dstiles




msg:4379347
 9:46 pm on Oct 25, 2011 (gmt 0)

Always had that one down as a goodie (although not preview, which is a new one on me). Had an email exchange with the owner a while ago, as well, which seemed to go well.

If they've moved operations to AWS they won't find me again, though.

Pfui




msg:4379402
 11:46 pm on Oct 25, 2011 (gmt 0)

Got twitter-swarmed a bit ago. In addition to a boatload of AWS bots, two particularly bad ones:

ec2-184-73-108-194.compute-1.amazonaws.com
MetaURI API/2.0 +metauri.com
robots.txt? NO
ERROR: Client sent malformed Host header <-- x2

ec2-50-18-24-18.us-west-1.compute.amazonaws.com
percbotspider
robots.txt? NO
ERROR: Client sent malformed Host header <-- x2

keyplyr




msg:4379423
 12:45 am on Oct 26, 2011 (gmt 0)

DuckDuck got some mention here when they first launched; seemed like a clever start-up. Too bad they're now coming from AWS.

dstiles




msg:4379772
 9:28 pm on Oct 26, 2011 (gmt 0)

Got several hits today on one site with the UA:

Test Spider 0.2

Imaginative! Hit with requests for a few long-standing pages, some long-missing pages and some never-there sitemap files. Blocked, of course.

Pfui




msg:4381322
 10:59 am on Oct 30, 2011 (gmt 0)

New Twitter-swarmer:

ec2-50-18-170-80.us-west-1.compute.amazonaws.com
NewsTrust

robots.txt? NO

And MetaURI is getting worse. Out of five hits, it blew THREE errors this time:

ec2-50-17-88-207.compute-1.amazonaws.com
MetaURI API/2.0 +metauri.com

[21:30:23 2011] [error] [client 50.17.88.207] Client sent malformed Host header
[21:30:23 2011] [error] [client 50.17.88.207] Client sent malformed Host header
[21:30:23 2011] [error] [client 50.17.88.207] Client sent malformed Host header

Pfui




msg:4386300
 8:28 pm on Nov 12, 2011 (gmt 0)

ec2-174-129-37-252.compute-1.amazonaws.com
wf_crawler (http://www.websitefigures.com)

robot.txt? NO

More details in the just-posted "wf_crawler" [webmasterworld.com...]

Pfui




msg:4386304
 8:33 pm on Nov 12, 2011 (gmt 0)

Yet another AWS somethingorother:

ec2-175-41-250-151.ap-northeast-1.compute.amazonaws.com
ceron.jp/1.0

robots.txt? NO

[robtex.com...]

keyplyr




msg:4388848
 12:16 pm on Nov 19, 2011 (gmt 0)

rDNS: ec2-50-112-27-181.us-west-2.compute.amazonaws.com
UA: Mozilla/5.0 (compatible; Bender; http://benderthewebrobot.tumblr.com)
robots.txt: no

Image scraper. I didn't have this range blocked. Maybe new?

50.112.0.0 - 50.112.255.255
50.112.0.0/16

Staffa




msg:4388876
 3:09 pm on Nov 19, 2011 (gmt 0)

Thanks for the heads up

175.41 and 50.112 are ranges new to me

Pfui




msg:4388929
 8:16 pm on Nov 19, 2011 (gmt 0)

FWIW: [webmasterworld.com...]

dstiles




msg:4388937
 9:01 pm on Nov 19, 2011 (gmt 0)

The 50.112/16 was new to me, too. Thanks.

Pfui




msg:4389261
 3:04 am on Nov 21, 2011 (gmt 0)

Yet another you-know-whatter:

ec2-174-129-32-219.compute-1.amazonaws.com
TweetReports.com

robts.txt? NO

Pfui




msg:4390979
 1:57 pm on Nov 25, 2011 (gmt 0)

Same bot, same behavior, two minutes apart:

ec2-184-72-68-95.compute-1.amazonaws.com
ec2-50-17-154-105.compute-1.amazonaws.com
SemrushBot/0.9

robots.txt? Yes

Previously about SemrushBot... [webmasterworld.com...]

keyplyr




msg:4391128
 9:37 pm on Nov 25, 2011 (gmt 0)

A month ago AWS made an announcement they've added a new US West (Oregon) Region. Possible new IP ranges?

[aws.amazon.com...]

This 88 message thread spans 3 pages: 88 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved