Forum Moderators: open
And then it started crawling as this user agent:
Nutch/0.9+(OpenX+Spider)
Guess who the host was:
OrgName: Amazon.com, Inc.
OrgID: AMAZO-4
Address: Amazon Web Services, Elastic Compute Cloud, EC2
Address: 1200 12th Avenue South
City: Seattle
StateProv: WA
I don't know who may be renting server time and bandwidth from Amazon, and I don't care -- No host rDNS, no admittance.
Jim
We launched our OpenX Hosted product a few weeks ago, which offers a free ad server for publishers up to 25 million impressions a month. We used to just provide open source software and paid consulting, but due to the demand, a hosted service was built.
I recognize that people have varying opinions about online advertising. Hopefully you can understand that we have many publishers in our community that depend on advertising in order to make their livelihood, and even more so in the current economic situation.
Because the service is free up to 25M impressions/month, along with the legitimate account requests, as expected, we also have gotten many many bogus ones. We have some documented guidelines in our terms of use policy on our hosted ad server that is intended to protect user privacy on the internet, illegal activities, abuse, and other scams I'm sure you've all seen and heard before.
As a result, we've needed to crawl all submissions as well as those sites' inlinks and outlinks to a certain depth in order to validate the content and legitimacy of the sites, do some scanning of the documents offline, and make sure we're not allowing anything we shouldn't.
Sorry to disappoint, but there wasn't a board meeting to think of creative ways to fool or trick you, or pull off some new scam. We switched the UA from Nutch to Mozilla to detect some cloakers that were found trying to get hosted accounts. But the description and website strings are still there in order to get in touch. We still use a modified version of Nutch, and it was chosen because some of us used to work at Yahoo, because it is open source, and because it can be automated for mass market products like this. Amazon is used because of our financial resource constraints, and not to hide anything, and I agree that they should give legitimate users a way to setup PTR records.
Of course, it's your website and you can ban the the bot as you please, but hopefully the description should tell you what we do with it. It will conservatively recrawl in order to catch cloaking sites that sign up under one purpose and switch later, but it should not be abusive or try to steal information. If it does something suspicious or if you have suggestions, please feel free to report it to us.
I'm sorry to sound so harsh, but I've logged two failed attempts because I block both nutch and anything coming from the roach motel that is amazonaws.com.
Sorry to disappoint, but there wasn't a board meeting to think of creative ways to fool or trick you, or pull off some new scam. We switched the UA from Nutch to Mozilla to detect some cloakers that were found trying to get hosted accounts. But the description and website strings are still there in order to get in touch. We still use a modified version of Nutch, and it was chosen because some of us used to work at Yahoo, because it is open source, and because it can be automated for mass market products like this. Amazon is used because of our financial resource constraints, and not to hide anything, and I agree that they should give legitimate users a way to setup PTR records.
mlum,
Welcome and many thanks for taking the time to register and offer and explanation.
My comment regarding the board meeting was tongue-in-cheek humor, which you failed to percieve.
Nor, was my comment any reflection on the plans or methods of your company.
Many, many bots are using names in their UA's which are simply inadequate (i. e., spider, crawl and others) and based upon effect, rather than names which a collective group of webmasters have ascertained as being "bad boys" in their dictionaries.
My attempt at humor was that your board selected four (three visible) "bad boys" (in their defense it may have been a marketing choice).
The humor also suggest being more creative in a UA name.
Don
My problem is with Amazon Compute Cloud, and not with you or your company or its activities. Due to the nature of providing temporary and/or enhanced compute and/or connectivity resources, Amazon does not bother to provide reverse-DNS lookups on the IP addresses used by companies such as yours which use their Compute Cloud services. This makes it impossible for Webmasters to determine whether requests from the Cloud are legitimate, or are from one of the many intellectual property thieves (scrapers), e-mail address harvesters, or other abusive denizens of the Web.
That is, if I see an session in my logs where a Web client is downloading pages from my site from a Compute Cloud IP address range, it is impossible for me to look up the WHOIS info and make any judgement about whether the requestor is legitimate or not. Some are, but many are not.
This is a problem that you as a Compute Cloud client should be looking into, for the simple reason that some Webmasters, having more productive things to do than chase down Compute Cloud users all day long, simply ban all access from the Cloud in order to protect their intellectual property and server bandwidth allocation. It is far easier to simply deny access to a large range of IP addresses than it is to spend hours every month issuing DMCA complaints. While the vast majority of sites won't take this action, I think you'll find that many small- and mid-sized sites with technically-astute Webmasters will -- and they tend to be sites with unique and valuable content that is worth protecting.
I'll leave judging the worthiness of your service for others to decide; Due largely to frustration from dealing with constant abuse of their sites, members here can appear a bit harsh about such things; I find it more productive to simply assess them in a cool, detached manner. You can disregard the emotions, but please heed the practical and technical aspects of the replies to this thread.
In part, you've also inherited a taint from Nutch's reputation. Despite warnings that I gave them here, they took no steps to require their users to include any organization-identifying information in the user-agent string sent by Nutch. This would have been a simple thing: simply refuse to spider until an rDNS-verifiable Web page and/or e-mail address was inserted into the user-agent string as a configuration step, but they didn't do it. As a result, we have a very capable spider loose on the Web, acting at the behest of persons unknown, and downloading our pages for purposes unknown. They may be legitimate, they may be stealing our content so they can slap ads all over it, re-publish it, and then compete with us for search engine ranking while trading on our names and reputations, or they may simply be using our server bandwidth for purposes which will in no way benefit us. We don't know, so it's much easier to just deny access to requests from Nutch.
After suffering constant abuse by content scrapers, e-mail address harvesters, log-spammers, and SQL-injection, WordPress, forum, and PHP hackers, you'll find that some Webmasters are frustrated, angry, and a bit paranoid. Their feelings are not unjustified.
Best,
Jim
In any case, I know you aren't a direct customer, but it appears there are other sites that are associated with, linked with, or connected to your site, thus the crawl. Apologies if it's unwelcome -- the costly alternative was having a human hand-review every single site (and its associates) requesting to use our free product. I have a publisher community to protect.
Don,
I did get the joke, and wish I could play along... since our customers are publishers and this is a popular publisher hangout that is in public view, this constitutes a written "official response" and so I have to put that rather dry hat on. TBH, we did spend some time thinking about naming the UA, but after doing a few test runs and seeing a variety of cloaking, this was the path of least resistance.
Jim,
Thanks for your detailed reply. We'll look into our alternatives again to see if we can build confidence around our crawl. AWS also seems to be a email spam haven, so having been a system administrator before, I can understand the broad blocklist. I'll take a look around Webmaster World to see what other sorts of things are coming from AWS, and I'll discuss it with our Amazon account manager.
I don't have a good solution in mind for the point you raise about Nutch. If you have a patch we could get it submitted, or I could write a patch to submit. As a fellow publisher of open source software, there's definitely a tradeoff between easy-to-use and secure. If we switched to using curl or heretrix I don't think it wouldn't really change much. Maybe I need to hold another board meeting to come up with a creative UA string for us :)
Thanks for the warning about harshness -- having worked in the digital marketing business for many years, I have some pretty thick skin...
Mike