MultiCrawler

Forum Moderators: open

Message Too Old, No Replies

MultiCrawler

Doesn't obey robots.txt

Mokita

2:58 pm on Jan 9, 2008 (gmt 0)

User Agent: multicrawler (+http://sw.deri.org/2006/04/multicrawler/robots.html)
Crawling from Irish IP: 140.203.154.nnn

This just hit one of our sites, asked for robots.txt and home page. Trouble is, the bot shouldn't have asked for anything after robots.txt as it simply disallows all crawlers:

User-agent: *
Disallow: /

From their info page:

MultiCrawler honors the Robots Exclusion Protocol.

wilderness

5:34 pm on Jan 9, 2008 (gmt 0)

Mokita,
"crawler" is a catch phrase, as previously mentioned.

The only significant SE that I recall using it is AltaVista.

Don

Mokita

2:57 am on Jan 10, 2008 (gmt 0)

Thanks Don - I don't think I have encountered a UA containing "crawler" previously, so saw no need to load up .htaccess with what seemed to be an unnecessary rule. It's already groaning under the weight of the many CIDRs I block ;)

Also, I prefer to block IPs rather than user agents where possible, as bots often change UA to try to look like browsers.

wilderness

3:11 am on Jan 10, 2008 (gmt 0)

Also, I prefer to block IPs rather than user agents where possible, as bots often change UA to try to look like browsers.

Mokita,
You may use both or a combination of both effectively.

EX:
A valid IP range by a well used provider in which denying the range would affect too many innocents.
Than, UA or UA with a condition of IP is more focused.

Don

incrediBILL

3:38 am on Jan 10, 2008 (gmt 0)

You guys are overly harsh on this one.

It's an actual real semantic web research project running in a university.

Don't you want the next Google killer to escape the labs? ;)

Mokita

3:54 am on Jan 10, 2008 (gmt 0)

Don't you want the next Google killer to escape the labs?

Not until it has grown up and learnt to obey robots.txt :)

Mokita

4:31 am on Jan 10, 2008 (gmt 0)

wilderness wrote:

You may use both or a combination of both effectively.

Thanks Don, I already do. But with the emphasis on blocking IPs "where possible", meaning I don't block ranges used by ISPs - unless they happen to come from Asia or Eastern Europe.

wilderness

5:03 am on Jan 10, 2008 (gmt 0)

You guys are overly harsh on this one.
It's an actual real semantic web research project running in a university.
Don't you want the next Google killer to escape the labs? wink

According to kiki, they are the 2d coming ;)

It's a diicult task in morals and personal preferences to keep "many third party" orgs within a delicate balance between what benefits webmasters (and their sites) and the tendency of the third party to leave their body excretions on the webmasters doorstep ;)

thetrasher

11:56 am on Jan 10, 2008 (gmt 0)

multicrawler obeys my robots.txt.

User-agent: *
Disallow: /

I'm using CR+LF.

Mokita

12:06 pm on Jan 10, 2008 (gmt 0)

thetrasher wrote

I'm using CR+LF.

Umm, please translate! I don't know what "CR+LF" means - and even if I did, surely:

User-agent: *
Disallow: /

means the exactly the same to all supposedly compliant bots?

thetrasher

12:29 pm on Jan 10, 2008 (gmt 0)

I'm using CR+LF as newline character sequence in robots.txt. There shouldn't be a difference between using a single LF (= 0Ah) and CR+LF (= 0Dh and 0Ah), but ...

volatilegx

6:34 am on Jan 12, 2008 (gmt 0)

Umm, please translate! I don't know what "CR+LF" means - and even if I did, surely:

Different operating systems use different characters to signify a new line.

I believe it's
CR... \r
LF... \n

JuUm

3:02 pm on Feb 6, 2008 (gmt 0)

Hi guys

i am one of the developer of the MultiCrawler.

Please can you give me the exact URL our crawler was visiting, that we can check what the error was.

Thanks

Juergen

MultiCrawler

Doesn't obey robots.txt

Mokita

wilderness

Mokita

wilderness

incrediBILL

Mokita

Mokita

wilderness

thetrasher

Mokita

thetrasher

volatilegx

JuUm

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week