Forum Moderators: open

Message Too Old, No Replies

Comrite/0.7.1

comrot robot identification query

         

fusion5

9:01 pm on Mar 31, 2006 (gmt 0)

10+ Year Member



Anybody know what this thing is, who they are?
Full UA was: Comrite/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)
They were coming from: 69.248.26.83

Any help would be appreciated.

Cheers!

HandwovenRug

9:26 pm on Mar 31, 2006 (gmt 0)

10+ Year Member



What about that:

meidong.comrite.com:69.248.26.83

CustName: Comcast Cable Communications, Inc
Address: 1800 Bishops Gate Boulevard
City: Mount Laurel
StateProv: NJ
PostalCode: 08054
Country: US
RegDate: 2004-12-07
Updated: 2004-12-07

NetRange: 69.248.0.0 - 69.248.255.255

I can't read their Chinese site, but I see no reason why they would send out an own crawler. Maybe it's logfile spamming.

Pfui

11:26 pm on Mar 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Among other things, Comrite claims to be the --

"Best Chinese Search Engine for Oversea Web Sites"

Or so sayeth the English linked-version of their Chinese site [comrite.com].

Aside...

What is the deal with everybody jumping on the Nutch bandwagon these days? I'm seeing this bot more and more and MORE, at least 20-30 times/day from different hosts now, and it always ignores robots.txt. (grumble,grumble)

Thank God -- and the Gods of Apache.org (ironically enough), and Ralf Engelschall -- for mod_rewrite

volatilegx

8:51 pm on Apr 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What is the deal with everybody jumping on the Nutch bandwagon these days?

From [lucene.apache.org...]

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Why reinvent the wheel?

Also, from [lucene.apache.org...] :

If you're reading this, chances are you've seen a Nutch-based robot visiting your site while looking through your server logs. Our software obeys robots.txt files and robot META tags in HTML. These are the standard mechanisms for webmasters to tell web robots which portions of a site a robot is welcome to access.

That being said, Nutch is open source, and as such, it would be possible to disable robots.txt compliance, or even make evil use of robots.txt directives, such as accessing dissallowed files and directories on purpose.

Pfui

11:33 pm on Apr 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the info, volatilegx. Actually, I knew its background:) That's why I found it ironic that Apache's mod_rewrite is my solution to deflecting Apache's Nutch (...and Apache's Jakarta).

I'm just surprised at how quickly Nutch is proliferating. Almost as if someone gave out free DVDs at some big geeky gathering. In no particular order, here are just a few of the scoff-at-robots.txt* guilty. Many are major, old-time crawlers, but a lot appear to be on their own:

.cs.washington.edu
.watson.ibm.com
.looksmart.com
.ee.ucla.edu
.simpy.com
.pureserver.info
.cs.titech.ac.jp
.serverkompetenz.net
.riesentoter.com
customer-reverse-entry.216.93.185.12 (ma.gnolia)
66.29.XX.XX (Net Access Corporation, NJ)
66.243.XX.XX (a business center in TX)
193.203.XXX.XXX (Aviators Network, UK)
slimy.vhosting.com
71-35-XXX-XX.tukw.qwest.net
164.124.XXX.XX.cfl.res.rr.com
c-24-99-XX-XXX.hsd1.ga.comcast.net
c-67-168-XXX-XX.hsd1.wa.comcast.net
c-69-248-XX-XX.hsd1.nj.comcast.net

Those guys are using these versions -- 0.05, 0.7.1, 0.7.2, 0.8-dev (@lists.sourceforge.net; @lucene.apache.org; @cs.washington.edu) -- and/or these spawn:

Argus/1.1 (Nutch; [simpy.com...] feedback at simpy dot com)
BurstFind Crawler 1.0/0.7.1 (Nutch; [lucene.apache.org...] crawler@burstfind.com)
Comrite/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)
Misterbot-Nutch/0.7.1 (Misterbot-Nutch; [misterbot.fr;...] admin@misterbot.fr)

Thing is, ALL of them are also completely undeterred by 403s. Hits over and over and over and OVER again, to scores of different pages. All for their own purposes, NONE of which are either apparent or useful to me.

Okay. "Rude Nutch Users" rant over:)

,
*Requesting pages the exact same second you request robots.txt is cheating.