homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Suggesting an improvement to Majestic's validation

WebmasterWorld Senior Member 10+ Year Member

Msg#: 4221514 posted 9:25 am on Oct 25, 2010 (gmt 0)

I just read the thread where Majestic is providing a validation key for webmasters in his crawl to "prove" it's a real bot.

A better system would be to "sign" the crawl using secret keys and the url crawled. This way, if the "secret" gets found out, it's unusable.

I believe if you check how Facebook and Twitter handle their "connect" initiatives, you will find a good example of it.

Something encrypted w/ Majestic's public key can only be read by the webmaster's private key....

So if the key gets found it, it's only good for that url, not sitewide.



WebmasterWorld Senior Member 5+ Year Member

Msg#: 4221514 posted 3:23 pm on Nov 26, 2010 (gmt 0)

I can't speak to the validation scheme, sorry, but I can at least report this MJ12bot sighting circa four hours ago:

Mozilla/5.0 (compatible; MJ12bot/v1.3.3; http://www.majestic12.co.uk/bot.php?+)
robots.txt? Yes


MJ12bot v1.3.0 Implements Ground Breaking Validation Capability [webmasterworld.com]

MJ12bot - When you're a hammer, everything looks like a nail [webmasterworld.com]


WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member

Msg#: 4221514 posted 4:01 am on Jan 5, 2011 (gmt 0)

On the subject of MJ12 "sighting reports," I'd like to offer the observation that at one point, I was getting hundreds of fake MJ12 request per month, and zero real MJ12 requests. After implementing the MJ12 validation scheme on several servers, the number of fake MJ12 'bots continued to rise for a few months, and then began to slowly decline. Now after about a year, I hardly ever see a fake MJ12 'bot attempting to gain access; the vast majority of requests with the MJ12 user-agent are legitimate.

So, for other distributed-bot owners out there: If you want to prevent spoofers from using your user-agent string and damaging your 'bots reputation, implement a simple validation scheme like MJ12bot did. It works.

On the use of encrypted keys as proposed above: Using an encrypted key is all well and good, but it requires a script to do the checking on the server end -- Not something that "Joe average Webmaster" is going to be able to implement easily. Checking for a specific string in a specific HTTP header and/or appended to the user-agent string is sufficient to the task of discriminating a legitimate distributed 'bot from a spoofer.

And if there is any doubt that the 'secret passphrase' has been compromised, one can simply change it.

There are perfect solutions and there are simple solutions. But only some of the simple methods actually solve the problem they're intended to address. The simplest solution that effectively solves a problem is termed "elegant." After noting the almost total disappearance of spoofed MJ12 requests, that's the term I'd use here.



WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member

Msg#: 4221514 posted 10:19 pm on Jan 5, 2011 (gmt 0)

After adding mj12 to robots.txt the spate of bots died down and finally ceased, at least on the web pages (I assume robots.txt is still hit occasionally).

I also now notice an absence of "fake" mj12 bots which prompts me to wonder:

a) were they really fake or simply out of date and out of control; OR

b) did the drivers of the fakes simple lose interest; OR

c) were the bots themselves gradually updated by the fakers (who in that case were running genuine mj12's).

I doubt very much the mj12's were (eg) renamed nutch bots or we would still be seeing them

Credit to Majestic for obeying robots.txt. Nevertheless, I still will not permit distributed bots on my server. There is far too little control over them OR too much hassle setting up for them (eg mj12's validation).

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved