| 11:54 pm on Feb 18, 2010 (gmt 0)|
robots.txt? Yes - ignored it.
Went after Homepage and left with a fat 403.
| 11:54 pm on Feb 25, 2010 (gmt 0)|
From the ddanchev security blog re: spamvertised casino games serving up viruses:
"What's particularly interesting about the campaign, is the fact that all of the domains serve identical template, with the (virusname) binary hosted 'in the cloud' thanks to Amazon's Web Services."
| 5:40 am on Feb 28, 2010 (gmt 0)|
Mozilla/5.0 [Internet Explorer]
| 4:33 am on Mar 2, 2010 (gmt 0)|
Made a beeline for two dynamically generated files. HEAD requests.
| 5:10 am on Mar 14, 2010 (gmt 0)|
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; firstname.lastname@example.org)
03/13 17:58:05 /robots.txt
Mozilla/5.0 (compatible; Feedtrace-bot/0.2; email@example.com)
03/13 17:58:06 /robots.txt
| 3:18 am on Mar 20, 2010 (gmt 0)|
| 2:36 pm on Mar 22, 2010 (gmt 0)|
JS-Kit URL Resolver, http://js-kit.com/
| 7:45 pm on Apr 8, 2010 (gmt 0)|
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:18.104.22.168) Gecko/2008102920 http://ow.ly web crawler (.NET CLR 3.5.30729)
| 5:45 pm on Apr 10, 2010 (gmt 0)|
TrueKnowledgeBot (http://www.trueknowledge.com/tkbot/; tkbot -AT- trueknowledge _dot_ com)
| 4:39 am on Apr 20, 2010 (gmt 0)|
Fake ref/Log spam? YES, the UA's Host
| 6:54 am on Apr 20, 2010 (gmt 0)|
Twitter-swarming 'web miner':
Mozilla/5.0 (compatible; ptd-crawler; +http://bixolabs.com/crawler/ptd/; firstname.lastname@example.org)
| 11:01 pm on Apr 24, 2010 (gmt 0)|
Mozilla/5.0 (compatible; Firefox Addon; Windows XP 5.1)
Four hits to / in five seconds (403s ignored).
| 9:33 pm on Apr 30, 2010 (gmt 0)|
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:22.214.171.124) Gecko/20100401 Firefox/3.6.3
| 5:31 pm on May 11, 2010 (gmt 0)|
| 7:21 pm on May 11, 2010 (gmt 0)|
Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com]
| 3:08 pm on May 16, 2010 (gmt 0)|
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)
| 10:51 pm on May 31, 2010 (gmt 0)|
|Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com] |
I'm the developer of the current acquia crawler. It actually has no shared code with the "old" one.
I'm sorry that it didn't check the robots.txt first. It SHOULD check the robots.txt first (and in general only look at the front page ("/") of the website).
It would be nice if you could PM me the URL of your site so I can see what might have caused this.
| 5:17 pm on Jun 21, 2010 (gmt 0)|
Real Yahoo doubtful, double-cloak, or double-cross? You be the judge...
Mozilla/5.0 [en] (X11; U; Linux 2.2.15 i686 +http://www.yahoo.com/index.html)
robots.txt? Yes, but promptly ignored.
06/21 02:30:13 /robots.txt
06/21 02:30:17 /
| 9:31 am on Jun 22, 2010 (gmt 0)|
Pfui... you are our resident aws expert. Have you considered updating your original post with new info? 168 messages to wade through... well, yeah, I suppose that's why we're here... the interaction. Yet, the issue is getting out of hand! As Boris Karloff used to say: BWAHAHAHAHA!
| 3:32 am on Jul 10, 2010 (gmt 0)|
tangor, long story short:
Block/deny/rewrite .amazonaws.com by name
And now, from the Department of Tweet ReReReRedundancy Department:
Fake Ref: @hourlypress
(Tweets upon retweets upon reretweets remind me more and more of e-mailed chain letter plagues.)
| 8:49 am on Jul 10, 2010 (gmt 0)|
I must be missing something because I still don't undestand why amazonaws.com is visiting so many sites and so often. Can someone please explain it?
Amazonaws.com is in almost all of my sites referral logs with lots of visits day after day, week after week, and month after month.
Why in the world does Amazon want to visit my sites in the first place? For what purpose? How do they know about my url's, especially the new sites? Is it somehow connected to the amazon.com affiliate program (however, the many sites they visit are not using the affiliate program and never did)?
Sometimes they visit a brand new site I just put online even before anyone else does or the site gets indexed anywhere.
How would Amazon know that my url was just put online? Sometimes they are my #1 traffic source! How and why is this happening? Please excuse my lack of knowledge on this subject.
| 6:21 pm on Jul 10, 2010 (gmt 0)|
This thread is speaking of amazonaws.com, a virtual computing service offered by Amazon. The actual owners of these agents that come to your web sites are customers of Amazon.
I suggest reading this entire thread as well as doing some searches to gain a better understanding of what happens at a cloud service. There are others besides amazonaws.com.
| 7:45 pm on Jul 10, 2010 (gmt 0)|
Thanks. I did read or scan most of the thread (but often too technical for me) but still do not understand why they are visiting my sites so much and what they have to gain by all the hits? Why is it worth the effort?
Also confused about how they even know about my websites in the first place (an issue I do not see answered) especially knowledge of my brand new site URL's just developed, and not indexed anywhere yet. Sometimes I see amazonaws.com in my log files as one of the first visitors to my new site just put online the day before!
Sorry I am not too educated on this or about cloud service which I never heard of before you mentioned it. P.S. I am not a programmer or a geek (obviously).
| 9:04 pm on Jul 10, 2010 (gmt 0)|
In a nutshell: "amazonaws.com" is an alias used by a multitude of anonymous AWS customers.
Amazon, like others, offers 'cloud' computing services where companies/people rent virtual server space. This thread is testament to the stunning number of bad robots/crawlers/crap apps hiding in the amazonaws.com cloud.
And those "amazonaws.com" hits you're seeing? Someone's using your resources for secret reasons.
| 10:07 pm on Jul 10, 2010 (gmt 0)|
To add to the list of bots from aws:
NetcraftSurvey, previously running from UK BT block 126.96.36.199/24 is now on aws with a new bot UA. At least, I think it's the same service: the new one includes netcraft.com in the UA and the old one had a referer of netcraft.com - they may serve different functions.
New UA: Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +email@example.com)
It's possible the old one is still running. First time I saw this was today.
| 3:29 pm on Jul 11, 2010 (gmt 0)|
Thanks Pfui, but what in the world could some of the the "secret reasons" possibly be? Maybe to send spam is one of them I am guessing?
We are aware of lots of spams being sent using disguised headers and claiming it's from me and from a (far from obvious unique name that could not be guessed) email address configured on the server.
However, those email addresses are not shown or listed on the web anywhere but only created when the site was setup on the server because the setup process wanted an email address to be entered (though we never used it or intended to use it).
| 4:39 pm on Jul 11, 2010 (gmt 0)|
Probably not quite so nefarious. I reckon secret reasons are more likely trade secret-related -- companies/people data-mining to build or improve their hoped-for better mousetraps.
I get why they hide. But when they use/abuse my stuff, I want WHOIS, if not Who, What, When, Where, Why. Because, heck, if their aims aren't nefarious, why not?
(A bit OT: Spam-wise, faked addresses can be from anybody anywhere. But with yours that private to begin with... I dunno. Perhaps your server was more vulnerable at some point that it is now?)
| 8:49 pm on Jul 11, 2010 (gmt 0)|
My experience of AWS is that it hosts a myriad of site scrapers or trivial bots (eg dozens of twitter/facebook me-too bots) that have no obvious use and no definite IP to block if we don't like them. My logs are full of the things, all blocked. Many of them pretend to be browsers, although certain characteristics belie that.
If a bot can't identify itself properly and supply a fixed address it's toast! I don't care who or what it is.
| 1:04 am on Jul 13, 2010 (gmt 0)|
A new addition to the post-Tweet swarm:
| 1:36 am on Jul 13, 2010 (gmt 0)|
What a waste of log space. All post-Tweet HEAD hits to the exact same file. (403s obviously ignored.) No robots.txt 'natch -- or ever, re most UAs from amazonaws.
| 5:20 pm on Jul 16, 2010 (gmt 0)|
I don't think most are hiding.
AWS cloud services just happen to be cheap and fast, something a startup company with limited resources would find appealing.