|Suggesting an improvement to Majestic's validation|
I just read the thread where Majestic is providing a validation key for webmasters in his crawl to "prove" it's a real bot.
A better system would be to "sign" the crawl using secret keys and the url crawled. This way, if the "secret" gets found out, it's unusable.
I believe if you check how Facebook and Twitter handle their "connect" initiatives, you will find a good example of it.
Something encrypted w/ Majestic's public key can only be read by the webmaster's private key....
So if the key gets found it, it's only good for that url, not sitewide.
I can't speak to the validation scheme, sorry, but I can at least report this MJ12bot sighting circa four hours ago:
Mozilla/5.0 (compatible; MJ12bot/v1.3.3; http://www.majestic12.co.uk/bot.php?+)
MJ12bot v1.3.0 Implements Ground Breaking Validation Capability [webmasterworld.com]
MJ12bot - When you're a hammer, everything looks like a nail [webmasterworld.com]
On the subject of MJ12 "sighting reports," I'd like to offer the observation that at one point, I was getting hundreds of fake MJ12 request per month, and zero real MJ12 requests. After implementing the MJ12 validation scheme on several servers, the number of fake MJ12 'bots continued to rise for a few months, and then began to slowly decline. Now after about a year, I hardly ever see a fake MJ12 'bot attempting to gain access; the vast majority of requests with the MJ12 user-agent are legitimate.
So, for other distributed-bot owners out there: If you want to prevent spoofers from using your user-agent string and damaging your 'bots reputation, implement a simple validation scheme like MJ12bot did. It works.
On the use of encrypted keys as proposed above: Using an encrypted key is all well and good, but it requires a script to do the checking on the server end -- Not something that "Joe average Webmaster" is going to be able to implement easily. Checking for a specific string in a specific HTTP header and/or appended to the user-agent string is sufficient to the task of discriminating a legitimate distributed 'bot from a spoofer.
And if there is any doubt that the 'secret passphrase' has been compromised, one can simply change it.
There are perfect solutions and there are simple solutions. But only some of the simple methods actually solve the problem they're intended to address. The simplest solution that effectively solves a problem is termed "elegant." After noting the almost total disappearance of spoofed MJ12 requests, that's the term I'd use here.
After adding mj12 to robots.txt the spate of bots died down and finally ceased, at least on the web pages (I assume robots.txt is still hit occasionally).
I also now notice an absence of "fake" mj12 bots which prompts me to wonder:
a) were they really fake or simply out of date and out of control; OR
b) did the drivers of the fakes simple lose interest; OR
c) were the bots themselves gradually updated by the fakers (who in that case were running genuine mj12's).
I doubt very much the mj12's were (eg) renamed nutch bots or we would still be seeing them
Credit to Majestic for obeying robots.txt. Nevertheless, I still will not permit distributed bots on my server. There is far too little control over them OR too much hassle setting up for them (eg mj12's validation).