Forum Moderators: open
LOG:
This Month --
mail.visvo.com - - [05/Jun/2006:22:40:56 -0700] "GET /robots.txt HTTP/1.0"
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
mail.visvo.com - - [08/Jun/2006:20:26:56 -0700] "GET /robots.txt HTTP/1.0"
"Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)"
Last Month --
mail.visvo.com - - [18/May/2006:23:43:23 -0700] "GET /robots.txt HTTP/1.0"
"Skywalker/0.1 (Skywalker; anonymous; anonymous)"
NOTES:
From dnsstuff.com (excerpted):
IP address: 63.133.162.98
Reverse DNS: mail.visvo.com
Reverse DNS authenticity: [Verified]
IncrediBILL's blog (URL in his profile [webmasterworld.com]) shows the same IP (06/08/2006).
Visvo.com [whois.domaintools.com] is a placeholder search page registered to -- Yahoo/Inktomi.
IMHO:
Thumb's down.
I really don't care that the bots from "mail.visvo.com" requested robots.txt. Clearly the host is relentless, and two of three bots are too precious about being 'anonymous' so I don't want to play their (or Yahoo/Inktomi's?) hide-the-ID game.
05/18/2006 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)"
06/02/2006 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
06/08/2006 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)"
I didn't notice that they were experimenting with different bots so I wonder if the last incarnation was just nutch with the user agent finally changed when they noticed everyone was blocking nutch.
Things like this led me to blocking entire data centers as the next obvious step, which I've documented in a couple of crawlers in their evolution, will be to switch the user agent to "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; )" when they notice they can't get what they need any other way.
The reason you are seeing mail.visvo.com is because it all goes out a single router which is reversed dns for mail servers that require it.
We were trying to get a full crawl of the dmoz directory which we have now completed so you shouldn't be seeing much crawling activity from us (in the next few weeks anyway). But , send me your url if you don't want us crawling your site and I will make sure to remove it from our crawl list.
We were trying to get a full crawl of the dmoz directory which we have now completed
Sorry to disappoint, but you didn't get a full crawl ;)
My logs showed I bounced it off the home page because of a BAD AGENT name 10 distinct times in a month with 3 user agent names.
Persistent if nothing else.
05/18/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
05/19/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
05/21/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
06/02/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/02/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/05/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/05/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/08/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"
06/08/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"
06/10/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"
If you can't take the time to whip out a quick web page to tell people why you're crawling their site then perhaps you shouldn't be crawling their site, it's kinda rude IMO.
That's very bad in itself, and a bad excuse as well. If you want to access people's sites, at least do them the courtesy of identifying yourself properly, putting up a Web page [webmasterworld.com] telling us what you're doing, and letting us decide whether we want you scraping our content and using our bandwidth. Working on a funded project, it's easy to forget that Webmasters pay for bandwidth and may need to make a cost/benefit analysis on the use of that bandwidth.
The alternative is to get listed on public 'nuisance' lists [webmasterworld.com], and have your user-agents, IP addresses, hostnames or even your entire IP address block banned.
Look at Nutch. More than a year ago, I suggested (here) to a Nutch developer that they require users to identify themselves as part of their Terms of Use. Apparently, they blew that off. Now 'Nutch' is considered to be a nuisance 'bot [webmasterworld.com] by many Webmasters, and that's really too bad.
Suggestions? Yes: Use a standard user-agent string [mozilla.org] something like:
Mozilla/4.0 (compatible; musepbot/0.1; devteam1/0.1; http:www.example.com/robot/musepbot.html)
Mozilla/4.0 (compatible; musepbot/0.1; devteam2/0.2a; http:www.example.com/robot/musepbot.html)
Mozilla/4.0 (compatible; musepbot/0.2; devteam3/0.1; http:www.example.com/robot/musepbot.html)
with the "devteam" field identifying deployment teams and their current version, while the "musepbot/0.1" might track the current Nutch revision. Point being that you don't need entirely-different or obscured user-agent names for any reason, and that using trunk/branch structure visible in the user-agent string allows specific project identification (some of us might even be glad to help by reporting problems, but only if you make it possible).
The days of turning a bot loose and naively believing that "it's OK" are long over. There are a lot of malicious denizens of the Web, and you need to be careful not to be classed among them.
The 'send us an e-mail to be taken off the crawl list' approach is a non-viable solution; It doesn't scale well at all, and why would a Webmaster want to 'give' his/her e-mail address to an unknown entity whose activities might very well look like an e-mail address scraper? Use the Web page approach; it's easier for everyone.
Jim
The days of turning a bot loose and naively believing that "it's OK" are long over
Sadly it's not true although I wish it were.
There are a lot of web sites out there that don't even have robots.txt or .htaccess nor does the webmaster even know what those files are.
The server control panels let webmasters set up their hosting accounts in almost every other aspect, including controlling the firewall, but lack support for such basic internet standards that would raise awareness and help control these problems.
Oh well, in a perfect world...
Is there anything else that we should do to make it easier for webmasters?