MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

Forum Moderators: open

Message Too Old, No Replies

MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

System

2:47 am on Jun 21, 2009 (gmt 0)

redhat

< split from [webmasterworld.com...] by incredibill - >

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

jdMorgan

8:27 pm on Oct 4, 2009 (gmt 0)

That'll work for me -- It eliminates two filter 'exceptions' that I had to make for MJ12, so that's good.

Thanks!
Jim

jdMorgan

8:30 pm on Oct 11, 2009 (gmt 0)

Kind of disappointed here... Aside from the 'test run' mentioned by Lord Majestic above, I've had no requests from MJ12bot/v1.3.1 or higher since then. It looks like the MJ12bot/v1.2.x requests are tapering off, but all I've seen today are MJ12bot/v1.3.0 requests without the crawler-ident feature.

Based on the 'timing' observed so far, I suspect it may be another week before MJ12 participants upgrade their 'nodes' and the newest MJ12 version starts showing up in my logs.

Other than the 'test run' already cited, is anybody else seeing MJ12bot/v1.3.1 (or higher) requests with the crawler-ident in them yet?

Jim

Lord Majestic

9:17 pm on Oct 11, 2009 (gmt 0)

jdMorgan - can you please send me details on those requests that were from 1.3.0 but without ident? We had a couple of issues reported so we are testing fix them to release new version that addresses both problems.

jdMorgan

9:39 pm on Oct 11, 2009 (gmt 0)

I sent you some log file data via stickymail.
Jim

keyplyr

11:13 pm on Oct 11, 2009 (gmt 0)

jdMorgan - Other than the 'test run' already cited, is anybody else seeing MJ12bot/v1.3.1 (or higher) requests with the crawler-ident in them yet?

Yes, I've been seeing the new crawler w/ ident for a couple weeks now. But still seeing older builds (no ident which receive 403.)

jdMorgan

11:44 pm on Oct 11, 2009 (gmt 0)

OK, thanks 'key... Maybe all of my URLs are in the "slow bucket." :)

Jim

Lord Majestic

8:02 pm on Oct 13, 2009 (gmt 0)

Runs tests on all sites that had registered for idents, this now uses new more standards compatible user-agent:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=SECRETIDENT)

We do seem to get 403s from what I believe is jdMorgan's website, will wait to investigate it with him before releasing new crawler - better test it right before release...

Ocean10000

9:53 pm on Oct 18, 2009 (gmt 0)

Below is a snipit of code for Asp.Net 2.0 and above website. This should properly identify the newer version of the MJ12bot bot along with gather the identification information from the User-Agent. I have not seen the bot in the wild yet so this is built from information posted to this thread.

jdMorgan

10:45 pm on Oct 18, 2009 (gmt 0)

You may also want to check the MJ12-proprietary Crawler-Ident HTTP header; Selecting that identification method in MJ12's control panel is an inherently more-secure option, as the 'secret' Crawler-Ident value can't normally get 'accidentally' published in standard access log or 'stats' files. It's also a better option for organizations where the server log files and stats are accessible to several or many persons.

Consistent with the example ident in the code above, the format of the new Crawler-Ident header is:

 MJ12bot/v1.3.1; ident:SECRETIDENT

Jim

keyplyr

1:13 am on Oct 19, 2009 (gmt 0)

@Jim

Agreed, however an increasing number of webmasters who will opting in for MJ12bot's SECRETIDENT feature mange sites on shared hosting (as I do) where header info is not available.

Ocean10000

1:43 am on Oct 19, 2009 (gmt 0)

Updated to check for the header as well if its not included in the User-Agent. I am waiting actual examples of a crawl results of actual bots to verify that this is working as expected.

example C# code in
//Checks if the bot is MJ12bot
if (Request.Browser.Browser == "MJ12bot")
{
//Checks if the Ident returned is the one you are expecting.
if (Request.Browser["ident"]!="yoursecretident")
{
//show the bot the door.
Response.StatusCode = 403;
Response.StatusDescription = "403 Forbidden";
Response.SuppressContent = true;
Response.End();
}
}

Lord Majestic

9:26 pm on Oct 24, 2009 (gmt 0)

I am sorry for the delay guys - the new crawler has now been released, it is version 1.3.1 (don't assume it won't increment though!).

The delay was due to a few issues that needed to be tested.

In one week we'll turn off old crawlers - not long to wait.

If you come across with any issues regarding new functionality then please don't hesitate to drop me email directly.

Thank you

Alex

Pfui

3:25 am on Nov 9, 2009 (gmt 0)

I'm unhappy to report that the first visit I've seen from any 'newer' bot (v1.3.1) neither came from a properly configured machine nor requested robots.txt. Here's the complete ELF access_log entry:

. - - [08/Nov/2009:17:08:54 -0800] "GET / HTTP/1.0" 403 1437 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)"

The bot was initially 403'd due to the 'dot' Host/IP (although had the Host/IP passed muster, it would've been 403'd because of no robots.txt).

I guess the UA could've been spoofed as it lacks the planned-for "ident=SECRETIDENT" part of the string. Thing is, the vast majority of fauxbot-runners I see use routine, even old, versions of bots. Why bother recoding to spoof something this new this soon?

Pfui

3:40 am on Nov 9, 2009 (gmt 0)

And FWIW, just about two hours later, 'visitor' #2 -- also faulty:

webn.tracking202.com
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)

robots.txt? NO

(I won't keep posting these unless the hits change for the better, or worse.)

jdMorgan

1:50 am on Nov 10, 2009 (gmt 0)

I haven't seen a valid v1.3.0 or v1.3.1 MJ12bot yet, except for a "manual test run" done by Lord Majestic himself. So far, they're all v1.2.5, or what I assume to be v1.3.x spoofers which don't send the crawler-ident in either the user-agent or the HTTP header. So I'm assuming that the roll-out of the new version has been delayed or is going much more slowly than predicted -- or perhaps all the MJ12bots that have ever visited my sites have all been spoofers. :(

Jim

Lord Majestic

10:44 pm on Nov 10, 2009 (gmt 0)

Hi,

We've rolled out v1.3.1 and we will be switching off older versions in a week. I am going to go through another test to confirm that 1.3.x definately works though - bugs always possible :(

jdMorgan

1:54 pm on Nov 17, 2009 (gmt 0)

I'm eagerly awaiting my first v1.3.x request with a valid crawler-ident... :)

Jim

Pfui

2:07 am on Dec 17, 2009 (gmt 0)

Still no news. Still no-robots hits. And now, a no-Host hit, too:

.
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)

robots.txt? NO

(That lone dot is all that appeared in my access_log as the Host.)

jdMorgan

3:14 am on Dec 17, 2009 (gmt 0)

I'm seeing at least a 50-to-1 ratio of MJ12 spoofers to real requests. The scale of the spoofing problem that faced MJ12 has only become obvious with the release of 1.3.1 and its validation capability, and the number of fake MJ12 requests has surprised (OK, I should say stunned) both me and LordMajestic.

The real MJ12 does fetch and obey robots.txt, but unless you've set your crawler-ident string and are testing for it, you'll never know the real MJ12 from the fakers. And it's now quite clear that there are a *lot* more fakers...

You might want to have a word with your host about that "lone dot" problem; Your server should fall back to giving you the IP address if there is no rDNS available for a requesting IP and it can't give you the hostname.

Jim

Pfui

4:49 am on Dec 17, 2009 (gmt 0)

Actually, I prefer to simply block any Host/IP without a dot, and any Host/IP that's only a dot. (In fact, you kindly provided the rewrite code for the former way back when, tyvm:)

Lord Majestic

12:31 pm on Dec 17, 2009 (gmt 0)

This spoofing is very worrying - we don't get many reports of it, in fact I was very suprised it was the case - perhaps they go for specific sites of interest?

dstiles

10:55 pm on Dec 17, 2009 (gmt 0)

I got the one below today. No idea if it's genuine or not as I haven't signed up. I block all distributed bots, as I've said before.

UA: Mozilla/5.0 (compatible; MJ12bot/v1.3.2; [majestic12.co.uk...]
IP: 84.168.193.nnn (Deutsche Telekom AG)
Headers: Accept only
No referer

As to why spoofing: if you can't determine whether or not the bot is genuine then it's a field day, and I doubt very much if many web site owners know anything about signing up for it - or even about trapping bots. Look in the logs, hey, the bot's got an info page, must be genuine! "Allow it in robots.txt, which I've just about mastered." Bingo.

keyplyr

12:33 am on Dec 18, 2009 (gmt 0)

With the overwhelming number of questionable MJ12bots, IMO many are authentic, but haven't been updated. I don't have any evidence of this, but there are just too many. Out of 50 MJ12bot hits, maybe 2 will be 1.3x and validate w/ the secret ID.

KenB

2:53 am on Dec 18, 2009 (gmt 0)

I love the bot validation key idea. I've registered and am awaiting my email confirmation.

One question I have is what would be the .htaccess entry we could use to block all fake MJ12bot hits while allowing the real hit with the proper key?

keyplyr

3:44 am on Dec 18, 2009 (gmt 0)

@KenB

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP_USER_AGENT} !YOUR-SECRET-ID
RewriteRule !^robots\.txt$ - [F]

Change YOUR-SECRET-ID and allow any custom 403 page if applicable.

jdMorgan

4:26 am on Dec 18, 2009 (gmt 0)

MJ12 also supports sending a custom HTTP request header containing the "YOUR-SECRET-ID". This method is somewhat safer since there is absolutely no risk of your secret ID getting published accidentally in log or stats files -- or even being posted in a forum accidentally due to webmaster errors.

It's also much safer if you're in an organization or business arrangement where multiple people have log file access, but not all are supposed to know the value of "YOUR-SECRET-ID".

In order to use the HTTP header method, select that option in the MJ12 setup form, and then use a modified version of the code above:


RewriteEngine On
#
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP:Crawler-Ident} !=YOUR-SECRET-ID
RewriteRule !^robots\.txt$ - [F]

Actually, MJ12 will allow you to select both methods. There's some merit in that for initial testing, but generally no advantage to using both. Use the Crawler-Ident method if you can, otherwise use the User-agent-string method.

Jim

KenB

8:23 pm on Dec 18, 2009 (gmt 0)

Thanks!

I've implemented jdMorgan's solution. I like the little extra security it provides. Now if we could just get the other legit bots to implement the same concept to help us stop spoofing bots.

I took a look at the stats provided to us once we log in and they are pretty impressive.

Pfui

9:51 pm on Dec 18, 2009 (gmt 0)

At this time, I think distributed bots could consider implementing ID schemes (yes, that means you, 80legs [webmasterworld.com]). At least most majors (& some minors) can be confirmed/denied depending on their Host/IP (via rDNS/rIP).

How-to info/links here (old but okay): search robots in disguise
(scroll to) But what about crawlers that aren�t so well-behaved?
[bing.com...]

Pfui

5:57 pm on Jan 2, 2010 (gmt 0)

Well, here's a server farm keeping up with versions. Can't say as I've even seen that with a bot-runner using a faked UA. (Note: The same single letter is obfuscated in the Host name.)

JANUARY, 2010

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

JUNE, 2009 (thread [webmasterworld.com])

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.4; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

Ironically, June, 2009, was when this thread announced MJ12bot's 'ground breaking validation capability' -- an ident scheme I've not seen requested. Oh, and about robots.txt --

Since v1.2.1, I've not seen any MJ12bot request robots.txt. Numerous versions, scores of Hosts and hits... None. Zero. Nada. Zip.

KenB

6:22 pm on Jan 2, 2010 (gmt 0)

Anyone want to confirm whether or not Pfui's host name is for the legitmate MJ12bot or not (change the asterisk to a 'c')?

This 99 message thread spans 4 pages: 99