Forum Moderators: open

Message Too Old, No Replies

crawler4j

         

incrediBILL

7:33 am on Oct 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



no robots.txt

Just hit the home page and went away

"crawler4j (http://code.google.com/p/crawler4j/)"

Someone tried using it from an Italian University - unisi.it from IP 193.205.7.*

Dijkgraaf

12:15 am on Nov 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, the issues forum for that crawler has "No, the current version does not support robots.txt"
I've added a note saying that this would get it banned by a lot of webmasters.

incrediBILL

2:55 am on Nov 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They shouldn't put it on the web before it honors robots.txt in the first place.

Just goes to show they're bad neighbors already.

Pfui

5:46 am on Nov 9, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Earlier today, from the respected Rensselaer Polytechnic Institute (coincidentally where, in 1865, an ancestor by marriage, John Flack Winslow, served as fifth president), came the disrespectfully coded and run "crawler4j" bot:

leo.tw.rpi.edu
crawler4j (http://code.google.com/p/crawler4j/)
robots.txt? NO

2 hits in 1 second.

(Where else can you get bot bits and family tree trivia in one byte?:)