Forum Moderators: open
Note: URL removed to comply with T&Cs here, here is the explanation text from site:
-----------------------------------------------------
MJ12conan is a specialised browser that is used to test visual content analysis technology that is under development by Majestic-12.
If you came to this page from a link in your log file then please be aware that MJ12conan is NOT a bot or crawler: it works no differently from browser and requires manual input of URL into equivalent of a "location" field. Since its just a browser robots.txt standard is not supported as it does not apply to browsers that are driven by humans.
-----------------------------------------------------
I am posting this with hope to get my assumption of being right in not complying with robots.txt in this case. I know that some of you have very strong feelings about bots not complying with robots.txt (and I agree with you insofar bots are concerned), but this tool is not a bot but a browser of special kind. It can't and won't be used for crawling (analysed data will be discarded - its mainly for visual validation of content analysis technology) and I expect number of actual requests being done to be pretty close to 0, so there is a very good chance you won't see it ever.
Its still not too late to change the tool to support robots.txt but I wanted to validate my theory that since its a browser rather than bot, then robots.txt does not apply to it.
If I understand you correctly this specialized browser will be used by a human to check pages that were crawled by your bot.
Incorrect -- this specialised browser is used to look at pages chosen by the human (they have to be typed or pasted into equivalent of a location bar), just like Firefox and IE do. It is totally unrelated to my bot (MJ12bot) that supports robots.txt and will continue to do so. The reason I picked special user-agent is that because I know you guys watch people who don't request images and I don't want anybody to derive incorrect conclusions about the nature of requests made to your servers.
I'd like:
"Mozilla/5.0 (compatible; Conan/1.0.0 browser; http://example.com/conan.html)"
or something similar. It might save you some trouble with sites that require "Mozilla/" for browser sniffing. Also I'd suggest a distinct page (conan.html in the example above) telling webmasters/log-checkers just what you posted above. To do otherwise is to invoke the wrath of Crom! ;)
Edited -- Almost forgot the main point: If a human types or cut/pastes the URL into this Conan thing, then it's not a 'bot. Only automated user-agents can reasonably be required to request and honor robots.txt. As a matter of fact, I get suspicious when I see browsers (or alleged browsers) looking at robots.txt.
Jim
[edited by: jdMorgan at 1:34 am (utc) on Nov. 3, 2005]
"Mozilla/5.0 (compatible; Conan/1.0.0 browser; http://example.com/conan.html)"
Ummm, suppose since its not a bot, but a browser then it would be logical to use Mozilla's like user string - thanks for this idea :)
My point about robots.txt bot/browser issue was the main one, thanks for sharing your opinion too!