homepage Welcome to WebmasterWorld Guest from 107.22.45.61
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 99 message thread spans 4 pages: 99 ( [1] 2 3 4 > >     
MJ12bot v1.3.0 Implements Ground Breaking Validation Capability
System
redhat



msg:3985375
 2:47 am on Jun 21, 2009 (gmt 0)

< split from [webmasterworld.com...] by incredibill - >

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

 

Lord Majestic




msg:3981469
 7:13 pm on Aug 31, 2009 (gmt 0)

I've just sent message to incrediBILL saying that we started beta testing new crawler identification mechanism.

The way it is designed to work (without getting into specifics that are not allowed in this forum):

1. You register for free on our site.

2. You verify your domains same way as in Google Webmaster Tools.

3. You supply unique identification string that MJ12bot will send to your verified sites when they get crawled next time: the delay should be matter of few hours, not too long. The ident will be put either in User-Agent and/or new header called Crawler-Ident - perhaps other companies will choose to use this approach as well.

Note: this will be done by new version of MJ12bot: v1.3.0 (or higher) - in beta test now, older versions (1.2.4 and 1.2.5 will still crawl for a few weeks).

The plan is to turn off old versions in 3-4 weeks time, and this I hope will do what incrediBILL suggested we can do - I don't think we can do more than that I am afraid!

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

dstiles




msg:3982311
 10:24 pm on Sep 1, 2009 (gmt 0)

So if we don't sign up you won't crawl? Ok, I know what the answer is. You will still get a 403. I do not tolerate distributed bots. At all.

Lord Majestic




msg:3982819
 5:47 pm on Sep 2, 2009 (gmt 0)

So if we don't sign up you won't crawl?

Err? We will crawl all sites obeying robots.txt like we do normally.

If you want our bot to send your site unique text string however then you need to register and specify which string to which sites you wish our bot to send.

I do not tolerate distributed bots. At all.

That's fine - our bot will actually treat 403 on robots.txt request as a sign that we should not crawl this site. It's not a standard, but recommended by it :)

dstiles




msg:3983017
 9:13 pm on Sep 2, 2009 (gmt 0)

So yours is an opt-out not an opt-in system. Fair enough. I already knew that, as I said.

Life is far too short to add every bot to robots.txt in the vain hope (present copmany possibly excepted) that it wll be obeyed. One day, after I retire, I may actually get around to a whitelist robots.txt rather than a blacklist.

Anyone who 403's robots.txt is asking for trouble from the "real" SEs. It's the pages that should be (and are) 403'd.

jdMorgan




msg:3983116
 2:48 am on Sep 3, 2009 (gmt 0)

Not quite sure what the argument is about here. The 'real' MJ12 'bot fetches and obeys robots.txt. If you want to use the "MJ12bot validation feature," then yes, that is opt-in via registration -- as it must be in order to guarantee that someone else doesn't 'request' a key for your site, thereby denying you the use of a key.

So the opt-in/registration applies to Webmasters who are willing to let MJ12 crawl as long as it is a legitimate/real MJ12 'bot, and not one of the spoofers that we've been discussing.

If you deny access to your pages via robots.txt only, then one of these spoofers may fetch them anyway. But the real MJ12 won't (discounting the possibility of occasional bugs). So you can 403 all 'bots claiming to be MJ12 because the only ones that will fetch your pages in violation of robots.txt will be the spoofers.

But if you allow MJ12 in robots.txt and you want to verify that requests are from the real MJ12, then you can register and specify a key that the real MJ12 will use when it requests resources from your site. It's a clever solution to the problem of distributed 'bots not being able to provide reverse-DNS because many of the instances of that 'bot will be on dynamic IPs, and coming from various ISPs.

Again, I think it's a clever solution because it benefits both sides -- Webmasters by being able to validate legitimate MJ12 requests, and MJ12 itself, because the spoofers will be revealed for what they are, and perhaps stop damaging MJ12's reputation.

So, I think Lord Majestic and IncrediBill deserve some kudos here, and I hope that they get the credit for being the first to come up with a practical solution to an old, old problem. (Hint, gentlemen: Publish!)

Jim

Lord Majestic




msg:3983393
 3:03 pm on Sep 3, 2009 (gmt 0)

So yours is an opt-out not an opt-in system. Fair enough. I already knew that, as I said.

It is opt-in system actually - only sites that opt-in to receive unique (for those sites) bot ID that can be easily used to ban all fake MJ12bots.

We certainly obey robots.txt, problem is that it is difficult for a site to keep track if particular bot got robots.txt, with our solution you can easily ban all fake MJ12bot because they won't transmit secret phrase that you set yourself for your own sites.

We should start upgrading all our crawlers to this new version (v1.3.0) next week and will post about this solution on our bots page.

I think the solution is pretty decent - I certainly can't see what more can be done in our case since we can't, unfortunately, guarantee that reverse DNS will always point to specific domain :(

I am also considering allowing adding sites that will get secret key without verifying ownership - maybe exclude very large domains, this would make it easier to add domains to the list (I appreciate that adding file to root of each domain for verification can be troublesome if you have many domains).

We also can manually verify your IP address if you host a lot of domains on it, please feel free to contact me :)

I hope this solution proves that we are committed to doing all we can in order to be good netizen and make it harder for fakers to use our bot's name.

btw, I'd say the idea is pretty much 99% IncrediBILL's - I think he was suggesting something like this couple of years ago or so. :)

incrediBILL




msg:3983458
 4:58 pm on Sep 3, 2009 (gmt 0)

Congrats to Lord Majestic for this ground breaking work.

MJ12Bot is most likely the first distributed crawler to implement a way for webmasters to validate the crawler while still rebuffing all the fakes.

It'll be interesting to see how well this works!

carguy84




msg:3983481
 5:54 pm on Sep 3, 2009 (gmt 0)

Ok, I think I understand the verification, but is it to secured areas of your site or is it to validate the real bot?

And who is using this bot Google, Yahoo or Bing?

incrediBILL




msg:3983490
 6:01 pm on Sep 3, 2009 (gmt 0)

And who is using this bot Google, Yahoo or Bing?

The bot is MJ12Bot, used by Majestic 12 search engine [majestic12.co.uk] and it also powers the MajesticSEO tools.

While their search engine probably won't currently send you any significant traffic, their SEO tools give you a 100% unique look at the web with lots of detailed information opposed to the big 3 SEs.

is it to validate the real bot

Yes, to validate the real bot.

There are a lot of fakers of their bot causing problems and now webmasters can easily distinguish between the real bot and the fakers.

Lord Majestic




msg:3983492
 6:05 pm on Sep 3, 2009 (gmt 0)

Thanks should go to Bill for suggesting this idea (some years ago if I am not mistaken).

In retrospect maybe we should have build it in by design - with distributed crawling model (which will only get more popular rather than less), it is impossible to have nicely defined IP subnet ranges or even reverse DNS capability, so something like Bill suggested was necessary :)

Note: our new crawler that supports it is in beta testing among our project members, so far so good - we plan to release it this weekend and in 2 weeks or less it should be the only legit MJ12bot crawling the Web :)

jdMorgan




msg:3983495
 6:11 pm on Sep 3, 2009 (gmt 0)

I am also considering allowing adding sites that will get secret key without verifying ownership - maybe exclude very large domains, this would make it easier to add domains to the list (I appreciate that adding file to root of each domain for verification can be troublesome if you have many domains).

Well, work through this idea carefully then, because there is some danger in it.

Let's say I want to make trouble for WebmasterWorld. I (a troublemaker) register for a bot-key at MJ12. Then the real WebmasterWorld Webmaster tries to register for one as well. What happens? If you accept the new registration without a WebmasterWorld-server-side validation, then the roles reverse, but the problem remains. As soon the WebmasterWorld Webmaster gets the key installed, I (the troublemaker) simply go back and re-request a new key, so WebmasterWorld starts rejecting MJ12 crawls again.

Really, the only way to do it is to require a site to register (or to ask for a key) at MJ12 and then go check for an authentication file, HTTP header, or HTML meta-tag on the registering site -- just like G, Yahoo, MSN/Live/Bing, etc. do. After doing that, you can be sure you're handing the key to the site owner, and your new authentication key mechanism for distributed 'bots can be deployed on the site to make sure that the site is handing content to the real MJ12 'bot.

Jim

Lord Majestic




msg:3983511
 6:27 pm on Sep 3, 2009 (gmt 0)

Well, work through this idea carefully then, because there is some danger in it.

I agree, it's not something I'd jump into easily (or at all) - I do prefer existing verification process (this also includes check that MJ12bot can actually crawl that site in the first place). The extra bonus is that going through this verification will give you backlinks reports as well.

incrediBILL




msg:3983516
 6:33 pm on Sep 3, 2009 (gmt 0)

@jdMorgan - to properly do this you would need 2 keys, not just 1. The key installed on the website for MJ12bot to verify would need to be different than the key MJ12bot sends to the website for validation purposes.

Otherwise, the fakes MJ12bots would simply extract the key from the meta tags and pretend to be MJ12bot thus making the problem perpetual. If the verification key is stored as a file, it's not an issue because the fakes won't know which file, but the meta tag exposes it and all bets are off.

Lord Majestic




msg:3983519
 6:43 pm on Sep 3, 2009 (gmt 0)

The key installed on the website for MJ12bot to verify would need to be different than the key MJ12bot sends to the website for validation purposes.

That's right and this is how we do it - each verification file placed on any domain is unique to that domain AND the user who is verifying it. The key that we will send to the site when crawling it is user-defined (we just limit it to latin chars and digits).

We also added ability to specify Crawl-Delay for verified sites, so you can slow down our bot to up to 20 seconds between requests as well :)

This reminded me to add META-tag verification as well...

Demaestro




msg:3983611
 8:57 pm on Sep 3, 2009 (gmt 0)

Lord M

Congrats on implementing something useful.

The more I think about it the more I think:

"What a great idea"

keyplyr




msg:3983655
 9:55 pm on Sep 3, 2009 (gmt 0)

1. You register for free on our site.

Sorry, I'm tired of "registering" for stuff. I've always allowed your bot since you're a member here and have been up front and transparent, but not going to register.

mack




msg:3983670
 10:40 pm on Sep 3, 2009 (gmt 0)

Lord Majestic, this is an excellent idea and thank you for sharing with us. I appreciate what you are trying to do, and the concept is sound. I expect this is something the majors may get involved with. I hope so because this will greatly hinder people trying to run fake bots.

You have my support on this because you are one of the small players launching a useful initiative, as opposed to a major rolling something and assuming a new standard.

All the best.

Mack.

Receptional




msg:3983671
 10:41 pm on Sep 3, 2009 (gmt 0)

If you don't register, then you are fine keyplyr. Nothing changes for you :)

Let's be clear - this implementation was requested by VERY tech savvy webmasters. It makes no difference to the average webmaster. If you don't frequent the spider identification forum [webmasterworld.com], then this enhancement won't affect you. It is purely there to stop black hats.

I expect only a few hundred webmasters in the whole world will use this - it is highly technical, but it is seriously stronger than Google's policy, which requires us all to second guess Google's IP numbers to be able to stop scrapers pretending to be Googlebot,

Disclaimer: I'm biased,

But kudos to IncrediBill for suggesting it months ago and to Alex (Majestic) for following through. I know that Alex has a heck of a lot more on his plate at the moment, so taking the time to get this implemented is no small feta.

incrediBILL




msg:3983675
 10:59 pm on Sep 3, 2009 (gmt 0)

@keyplyr "but not going to register" - then you let the fakes win, not good.

For someone that fights the good fight to keep the bad bots at bay I'm perplexed at your reaction to a simple registration to stop the bad bots giving MJ12Bot a bad name.

Besides, MajesticSEO gives you more information than the major SEs about your site by registering so you actually get a lot more than the minimal effort required to register in order to block the fake bots.

Lord Majestic




msg:3983686
 11:14 pm on Sep 3, 2009 (gmt 0)

Hi guys,

Thanks for kind words and support, I am glad that we put time into implementing this idea :)

@keyplyr: as Receptional said you don't need to register (we only ask for email, name and password as required fields, very hassle free). It's just if you do want unique identification string sent to your own sites then you have to register and verify sites as otherwise we won't be able to do it I am afraid, this is the minimally necessary data that we need in order for this system to work.

Just like webmasters we've been a victim of fake bots that were overloading sites, disobeying robots.txt whilst claiming to be MJ12bot's - this was very difficult period for us (and webmasters who got hit!). This motivated us to to create this mechanism.

The credit for the system should go to incrediBILL however - it was him who suggested it at least 2 years ago I think, and again few months ago - I am just happy this it is well received by the community :)

IanCP




msg:3983693
 11:32 pm on Sep 3, 2009 (gmt 0)

OK, I'll bite. I'm looking at the Majestic12 home page now and see nothing relating to registration.

Rosalind




msg:3983696
 11:44 pm on Sep 3, 2009 (gmt 0)

OK, I'll bite. I'm looking at the Majestic12 home page now and see nothing relating to registration.

Nor me, and I had a good dig around the website. I imagine it's not live yet?

I'd like to suggest posting sample code that people can use to implement this, on the Majestic12 site. If this is worth doing, perhaps it's also worth making it accessible for less tech-savvy webmasters to use?

Lord Majestic




msg:3983704
 11:57 pm on Sep 3, 2009 (gmt 0)

Right, it's not on Majestic-12 site since we don't have there domain verification functionality.

I assume this is allowed by the mods, so will post details:

Step 1: Register (free) - https://www.majesticseo.com/register.php or login if already registered: https://www.majesticseo.com/login.php

Step 2: Verify your domains (also free) by adding them to Control Panel, same way as done in Google Webmaster Tools

Step 3: Setup crawler settings: https://www.majesticseo.com/crawlersettings.php

You can set the following:
1) crawl delay for verified sites (this will override robots.txt)
2) unique crawler identification that will be sent only to verified sites for that user: you can choose how it is sent, either added to User-Agent header and/or in new HTTP header called: Crawler-Ident

These changes will be uploaded to our project site from which crawlers take data regularly, every hour or two - so updates should be pretty quick. Now I still need to change our crawler to take them from new location - this is not done yet, but it already takes such sites from another place so won't take long, however we need first to go through beta period with new crawler, so we won't have ALL crawlers updated for couple of weeks I reckon.

Only MJ12bot's with version v1.3.x (or higher) will support this ident.

------

We'll link to this from our bot's homepage.

IanCP




msg:3983746
 12:43 am on Sep 4, 2009 (gmt 0)

Worked so far. 50,000 backlinks?

Lord Majestic




msg:3983757
 12:53 am on Sep 4, 2009 (gmt 0)

Worked so far. 50,000 backlinks?

You can retrieve them all! ;)

keyplyr




msg:3983768
 1:27 am on Sep 4, 2009 (gmt 0)

@keyplyr "but not going to register" - then you let the fakes win, not good.

For someone that fights the good fight to keep the bad bots at bay I'm perplexed at your reaction to a simple registration to stop the bad bots giving MJ12Bot a bad name.

Besides, MajesticSEO gives you more information than the major SEs about your site by registering so you actually get a lot more than the minimal effort required to register in order to block the fake bots.


Granted, but I've never seen any spoofs of MJ12bot and I think all this is overkill (but don't wish to devalue all the work it took to appease those who requested this implementation.)

incrediBILL




msg:3983774
 1:38 am on Sep 4, 2009 (gmt 0)

Granted, but I've never seen any spoofs of MJ12bot and I think all this is overkill.

How could you tell?

It's not like Googlebot where it had it's own IP range so you knew the difference.

Besides, if you do a good job of whitelisting like I do the only way to get into my server would be to probe the robots.txt file script to find out which bots I allow and then spoof the weakest which would be the distributed crawlers like MJ12bot.

Effective immediately, that security hole has been closed and there's a template for other distributed crawlers to follow in the future.

keyplyr




msg:3983788
 2:05 am on Sep 4, 2009 (gmt 0)

How could you tell?

Well, because I watch my logs diligently and Alex has posted for years here and on his own bot info pages about spoofing his UA.

Be careful when comparing others style of security to yours. I have my reasons for the way I do things, which is a mix of white listing and UA/IP/refer blocking.

Lord Majestic




msg:3983790
 2:10 am on Sep 4, 2009 (gmt 0)

Fair enough keyplyr - I am glad you did not get hit by fake MJ12bots, those who did however now have a choice - I can certainly sleep well now knowing that we offered this choice and did all we could! Good night! :)

This 99 message thread spans 4 pages: 99 ( [1] 2 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved