Forum Moderators: open

Message Too Old, No Replies

MultiCrawler

Doesn't obey robots.txt

         

Mokita

2:58 pm on Jan 9, 2008 (gmt 0)

10+ Year Member



User Agent: multicrawler (+http://sw.deri.org/2006/04/multicrawler/robots.html)
Crawling from Irish IP: 140.203.154.nnn

This just hit one of our sites, asked for robots.txt and home page. Trouble is, the bot shouldn't have asked for anything after robots.txt as it simply disallows all crawlers:

User-agent: *
Disallow: /

From their info page:

MultiCrawler honors the Robots Exclusion Protocol.

wilderness

5:34 pm on Jan 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mokita,
"crawler" is a catch phrase, as previously mentioned.

The only significant SE that I recall using it is AltaVista.

Don

Mokita

2:57 am on Jan 10, 2008 (gmt 0)

10+ Year Member



Thanks Don - I don't think I have encountered a UA containing "crawler" previously, so saw no need to load up .htaccess with what seemed to be an unnecessary rule. It's already groaning under the weight of the many CIDRs I block ;)

Also, I prefer to block IPs rather than user agents where possible, as bots often change UA to try to look like browsers.

wilderness

3:11 am on Jan 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also, I prefer to block IPs rather than user agents where possible, as bots often change UA to try to look like browsers.

Mokita,
You may use both or a combination of both effectively.

EX:
A valid IP range by a well used provider in which denying the range would affect too many innocents.
Than, UA or UA with a condition of IP is more focused.

Don

incrediBILL

3:38 am on Jan 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You guys are overly harsh on this one.

It's an actual real semantic web research project running in a university.

Don't you want the next Google killer to escape the labs? ;)

Mokita

3:54 am on Jan 10, 2008 (gmt 0)

10+ Year Member



Don't you want the next Google killer to escape the labs?

Not until it has grown up and learnt to obey robots.txt :)

Mokita

4:31 am on Jan 10, 2008 (gmt 0)

10+ Year Member



wilderness wrote:
You may use both or a combination of both effectively.

Thanks Don, I already do. But with the emphasis on blocking IPs "where possible", meaning I don't block ranges used by ISPs - unless they happen to come from Asia or Eastern Europe.

wilderness

5:03 am on Jan 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You guys are overly harsh on this one.

It's an actual real semantic web research project running in a university.

Don't you want the next Google killer to escape the labs? wink

According to kiki, they are the 2d coming ;)

It's a diicult task in morals and personal preferences to keep "many third party" orgs within a delicate balance between what benefits webmasters (and their sites) and the tendency of the third party to leave their body excretions on the webmasters doorstep ;)

thetrasher

11:56 am on Jan 10, 2008 (gmt 0)

10+ Year Member



multicrawler obeys my robots.txt.
User-agent: *
Disallow: /

I'm using CR+LF.

Mokita

12:06 pm on Jan 10, 2008 (gmt 0)

10+ Year Member



thetrasher wrote
I'm using CR+LF.

Umm, please translate! I don't know what "CR+LF" means - and even if I did, surely:

User-agent: *
Disallow: /

means the exactly the same to all supposedly compliant bots?

thetrasher

12:29 pm on Jan 10, 2008 (gmt 0)

10+ Year Member



I'm using CR+LF as newline character sequence in robots.txt. There shouldn't be a difference between using a single LF (= 0Ah) and CR+LF (= 0Dh and 0Ah), but ...

volatilegx

6:34 am on Jan 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Umm, please translate! I don't know what "CR+LF" means - and even if I did, surely:

Different operating systems use different characters to signify a new line.

I believe it's
CR... \r
LF... \n

JuUm

3:02 pm on Feb 6, 2008 (gmt 0)

10+ Year Member



Hi guys

i am one of the developer of the MultiCrawler.

Please can you give me the exact URL our crawler was visiting, that we can check what the error was.

Thanks

Juergen