"mail.visvo.com" runs trio o' bots: Anonymous; NutchCVS; Skywalker - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

"mail.visvo.com" runs trio o' bots: Anonymous; NutchCVS; Skywalker

TLD "visvo.com" nameservers = .yahoo.com (!)

Pfui

1:33 am on Jun 10, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

UA: Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)
HOST: mail.visvo.com

LOG:

This Month --
mail.visvo.com - - [05/Jun/2006:22:40:56 -0700] "GET /robots.txt HTTP/1.0"
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
mail.visvo.com - - [08/Jun/2006:20:26:56 -0700] "GET /robots.txt HTTP/1.0"
"Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)"

Last Month --
mail.visvo.com - - [18/May/2006:23:43:23 -0700] "GET /robots.txt HTTP/1.0"
"Skywalker/0.1 (Skywalker; anonymous; anonymous)"

NOTES:

From dnsstuff.com (excerpted):

IP address: 63.133.162.98
Reverse DNS: mail.visvo.com
Reverse DNS authenticity: [Verified]

IncrediBILL's blog (URL in his profile [webmasterworld.com]) shows the same IP (06/08/2006).

Visvo.com [whois.domaintools.com] is a placeholder search page registered to -- Yahoo/Inktomi.

IMHO:

Thumb's down.

I really don't care that the bots from "mail.visvo.com" requested robots.txt. Clearly the host is relentless, and two of three bots are too precious about being 'anonymous' so I don't want to play their (or Yahoo/Inktomi's?) hide-the-ID game.

incrediBILL

5:05 pm on Jun 12, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You should be ashamed, you made me look in my archive to verify this ;)

05/18/2006 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)"
06/02/2006 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
06/08/2006 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)"

I didn't notice that they were experimenting with different bots so I wonder if the last incarnation was just nutch with the user agent finally changed when they noticed everyone was blocking nutch.

Things like this led me to blocking entire data centers as the next obvious step, which I've documented in a couple of crawlers in their evolution, will be to switch the user agent to "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; )" when they notice they can't get what they need any other way.

musepwizard

4:52 pm on Jun 19, 2006 (gmt 0)

10+ Year Member

Actually it's the same bot. We are currently running a dev version of nutch and are getting the addresses we crawl from a dump of the dmoz directory. The names were changed only because there were different deployments and we were testing options, not to avoid sites that are blocking nutch (sorry but were not that evil). We used anonymous because we didn't have a bot page or a home page yet. I think our current deployement actually uses Nutch's default user agent settings.

The reason you are seeing mail.visvo.com is because it all goes out a single router which is reversed dns for mail servers that require it.

We were trying to get a full crawl of the dmoz directory which we have now completed so you shouldn't be seeing much crawling activity from us (in the next few weeks anyway). But , send me your url if you don't want us crawling your site and I will make sure to remove it from our crawl list.

incrediBILL

6:03 pm on Jun 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We were trying to get a full crawl of the dmoz directory which we have now completed

Sorry to disappoint, but you didn't get a full crawl ;)

My logs showed I bounced it off the home page because of a BAD AGENT name 10 distinct times in a month with 3 user agent names.

Persistent if nothing else.

05/18/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
05/19/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
05/21/2006 BAD_AGENT 63.133.162.98 "Skywalker/0.1 (Skywalker; anonymous; anonymous)" "/"
06/02/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/02/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/05/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/05/2006 BAD_AGENT 63.133.162.98 "NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)" "/"
06/08/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"
06/08/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"
06/10/2006 BAD_AGENT 63.133.162.98 "Anonymous/0.0 (Anonymous; [anonymous.com;...] noreply@anonymous.com)" "/"

If you can't take the time to whip out a quick web page to tell people why you're crawling their site then perhaps you shouldn't be crawling their site, it's kinda rude IMO.

musepwizard

6:16 pm on Jun 19, 2006 (gmt 0)

10+ Year Member

There are currently discussions going on on the nutch mailing lists as to how to fix the "problem" of multiple nutch crawlers from different developers hitting the same websites. Perhaps you have some suggestions?

jdMorgan

1:49 am on Jun 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> because we didn't have a bot page or a home page yet.

That's very bad in itself, and a bad excuse as well. If you want to access people's sites, at least do them the courtesy of identifying yourself properly, putting up a Web page [webmasterworld.com] telling us what you're doing, and letting us decide whether we want you scraping our content and using our bandwidth. Working on a funded project, it's easy to forget that Webmasters pay for bandwidth and may need to make a cost/benefit analysis on the use of that bandwidth.

The alternative is to get listed on public 'nuisance' lists [webmasterworld.com], and have your user-agents, IP addresses, hostnames or even your entire IP address block banned.

Look at Nutch. More than a year ago, I suggested (here) to a Nutch developer that they require users to identify themselves as part of their Terms of Use. Apparently, they blew that off. Now 'Nutch' is considered to be a nuisance 'bot [webmasterworld.com] by many Webmasters, and that's really too bad.

Suggestions? Yes: Use a standard user-agent string [mozilla.org] something like:
Mozilla/4.0 (compatible; musepbot/0.1; devteam1/0.1; http:www.example.com/robot/musepbot.html)
Mozilla/4.0 (compatible; musepbot/0.1; devteam2/0.2a; http:www.example.com/robot/musepbot.html)
Mozilla/4.0 (compatible; musepbot/0.2; devteam3/0.1; http:www.example.com/robot/musepbot.html)

with the "devteam" field identifying deployment teams and their current version, while the "musepbot/0.1" might track the current Nutch revision. Point being that you don't need entirely-different or obscured user-agent names for any reason, and that using trunk/branch structure visible in the user-agent string allows specific project identification (some of us might even be glad to help by reporting problems, but only if you make it possible).

The days of turning a bot loose and naively believing that "it's OK" are long over. There are a lot of malicious denizens of the Web, and you need to be careful not to be classed among them.

The 'send us an e-mail to be taken off the crawl list' approach is a non-viable solution; It doesn't scale well at all, and why would a Webmaster want to 'give' his/her e-mail address to an unknown entity whose activities might very well look like an e-mail address scraper? Use the Web page approach; it's easier for everyone.

Jim

incrediBILL

2:26 am on Jun 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The days of turning a bot loose and naively believing that "it's OK" are long over

Sadly it's not true although I wish it were.

There are a lot of web sites out there that don't even have robots.txt or .htaccess nor does the webmaster even know what those files are.

The server control panels let webmasters set up their hosting accounts in almost every other aspect, including controlling the firewall, but lack support for such basic internet standards that would raise awareness and help control these problems.

Oh well, in a perfect world...

musepwizard

5:56 pm on Jun 22, 2006 (gmt 0)

10+ Year Member

I appreciate your feedback. We are in the process of putting up a publicly accessible bot page and will have it deployed before our next crawl. The user agent string will be Visbot/1.0 (+http://www.visvo.com/bot.html;bot@visvo.com).

Is there anything else that we should do to make it easier for webmasters?