Forum Moderators: open
Any help would be appreciated.
Cheers!
meidong.comrite.com:69.248.26.83
CustName: Comcast Cable Communications, Inc
Address: 1800 Bishops Gate Boulevard
City: Mount Laurel
StateProv: NJ
PostalCode: 08054
Country: US
RegDate: 2004-12-07
Updated: 2004-12-07
NetRange: 69.248.0.0 - 69.248.255.255
I can't read their Chinese site, but I see no reason why they would send out an own crawler. Maybe it's logfile spamming.
"Best Chinese Search Engine for Oversea Web Sites"
Or so sayeth the English linked-version of their Chinese site [comrite.com].
Aside...
What is the deal with everybody jumping on the Nutch bandwagon these days? I'm seeing this bot more and more and MORE, at least 20-30 times/day from different hosts now, and it always ignores robots.txt. (grumble,grumble)
Thank God -- and the Gods of Apache.org (ironically enough), and Ralf Engelschall -- for mod_rewrite
What is the deal with everybody jumping on the Nutch bandwagon these days?
From [lucene.apache.org...]
Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Why reinvent the wheel?
Also, from [lucene.apache.org...] :
If you're reading this, chances are you've seen a Nutch-based robot visiting your site while looking through your server logs. Our software obeys robots.txt files and robot META tags in HTML. These are the standard mechanisms for webmasters to tell web robots which portions of a site a robot is welcome to access.
That being said, Nutch is open source, and as such, it would be possible to disable robots.txt compliance, or even make evil use of robots.txt directives, such as accessing dissallowed files and directories on purpose.
I'm just surprised at how quickly Nutch is proliferating. Almost as if someone gave out free DVDs at some big geeky gathering. In no particular order, here are just a few of the scoff-at-robots.txt* guilty. Many are major, old-time crawlers, but a lot appear to be on their own:
.cs.washington.edu
.watson.ibm.com
.looksmart.com
.ee.ucla.edu
.simpy.com
.pureserver.info
.cs.titech.ac.jp
.serverkompetenz.net
.riesentoter.com
customer-reverse-entry.216.93.185.12 (ma.gnolia)
66.29.XX.XX (Net Access Corporation, NJ)
66.243.XX.XX (a business center in TX)
193.203.XXX.XXX (Aviators Network, UK)
slimy.vhosting.com
71-35-XXX-XX.tukw.qwest.net
164.124.XXX.XX.cfl.res.rr.com
c-24-99-XX-XXX.hsd1.ga.comcast.net
c-67-168-XXX-XX.hsd1.wa.comcast.net
c-69-248-XX-XX.hsd1.nj.comcast.net
Those guys are using these versions -- 0.05, 0.7.1, 0.7.2, 0.8-dev (@lists.sourceforge.net; @lucene.apache.org; @cs.washington.edu) -- and/or these spawn:
Argus/1.1 (Nutch; [simpy.com...] feedback at simpy dot com)
BurstFind Crawler 1.0/0.7.1 (Nutch; [lucene.apache.org...] crawler@burstfind.com)
Comrite/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)
Misterbot-Nutch/0.7.1 (Misterbot-Nutch; [misterbot.fr;...] admin@misterbot.fr)
Thing is, ALL of them are also completely undeterred by 403s. Hits over and over and over and OVER again, to scores of different pages. All for their own purposes, NONE of which are either apparent or useful to me.
Okay. "Rude Nutch Users" rant over:)
,
*Requesting pages the exact same second you request robots.txt is cheating.