Welcome to WebmasterWorld Guest from 35.172.217.40

Forum Moderators: Ocean10000

Message Too Old, No Replies

MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

     

System

2:47 am on Jun 21, 2009 (gmt 0)

redhat

 
 


< split from [webmasterworld.com...] by incredibill - >

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

8:27 pm on Oct 4, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


That'll work for me -- It eliminates two filter 'exceptions' that I had to make for MJ12, so that's good.

Thanks!
Jim

8:30 pm on Oct 11, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Kind of disappointed here... Aside from the 'test run' mentioned by Lord Majestic above, I've had no requests from MJ12bot/v1.3.1 or higher since then. It looks like the MJ12bot/v1.2.x requests are tapering off, but all I've seen today are MJ12bot/v1.3.0 requests without the crawler-ident feature.

Based on the 'timing' observed so far, I suspect it may be another week before MJ12 participants upgrade their 'nodes' and the newest MJ12 version starts showing up in my logs.

Other than the 'test run' already cited, is anybody else seeing MJ12bot/v1.3.1 (or higher) requests with the crawler-ident in them yet?

Jim

9:17 pm on Oct 11, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


jdMorgan - can you please send me details on those requests that were from 1.3.0 but without ident? We had a couple of issues reported so we are testing fix them to release new version that addresses both problems.
9:39 pm on Oct 11, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I sent you some log file data via stickymail.
Jim
11:13 pm on Oct 11, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


jdMorgan - Other than the 'test run' already cited, is anybody else seeing MJ12bot/v1.3.1 (or higher) requests with the crawler-ident in them yet?

Yes, I've been seeing the new crawler w/ ident for a couple weeks now. But still seeing older builds (no ident which receive 403.)
11:44 pm on Oct 11, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


OK, thanks 'key... Maybe all of my URLs are in the "slow bucket." :)

Jim

8:02 pm on Oct 13, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Runs tests on all sites that had registered for idents, this now uses new more standards compatible user-agent:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=SECRETIDENT)

We do seem to get 403s from what I believe is jdMorgan's website, will wait to investigate it with him before releasing new crawler - better test it right before release...

9:53 pm on Oct 18, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month

joined:Jan 14, 2004
posts:864
votes: 3


Below is a snipit of code for Asp.Net 2.0 and above website. This should properly identify the newer version of the MJ12bot bot along with gather the identification information from the User-Agent. I have not seen the bot in the wild yet so this is built from information posted to this thread.



<browsers>
<!--10-18-09 -->
<!--http://www.webmasterworld.com/search_engine_spiders/3983454-7-10.htm -->
<browser id="MJ12botWebmasterWorld" parentID="mozilla">
<sampleHeaders>
<header name="User-Agent" value="Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=SECRETIDENT)" />
</sampleHeaders>

<identification>
<userAgent match="MJ12bot/v(?'version'(?'major'\d+)(?'minor'\.\d+))" />
</identification>
<capture>
<userAgent match=";\sident=(?'Ident'\w*)\)" />
</capture>
<capabilities>
<capability name="crawler" value="true" />
<capability name="browser" value="MJ12bot" />
<capability name="majorversion" value="${major}" />
<capability name="minorversion" value="${minor}" />
<capability name="version" value="${version}" />
<capability name="ident" value="${Ident}" />
<capability name="tagWriter" value="System.Web.UI.HtmlTextWriter" />
</capabilities>
</browser>
</browsers>

10:45 pm on Oct 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


You may also want to check the MJ12-proprietary Crawler-Ident HTTP header; Selecting that identification method in MJ12's control panel is an inherently more-secure option, as the 'secret' Crawler-Ident value can't normally get 'accidentally' published in standard access log or 'stats' files. It's also a better option for organizations where the server log files and stats are accessible to several or many persons.

Consistent with the example ident in the code above, the format of the new Crawler-Ident header is:

 MJ12bot/v1.3.1; ident:SECRETIDENT 

Jim

1:13 am on Oct 19, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


@Jim

Agreed, however an increasing number of webmasters who will opting in for MJ12bot's SECRETIDENT feature mange sites on shared hosting (as I do) where header info is not available.

1:43 am on Oct 19, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month

joined:Jan 14, 2004
posts:864
votes: 3


Updated to check for the header as well if its not included in the User-Agent. I am waiting actual examples of a crawl results of actual bots to verify that this is working as expected.



<browsers>
<!--10-18-09 -->
<!--http://www.webmasterworld.com/search_engine_spiders/3983454-7-10.htm -->
<browser id="MJ12botWebmasterWorld" parentID="mozilla">
<sampleHeaders>
<header name="User-Agent" value="Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=SECRETIDENT)" />
</sampleHeaders>

<identification>
<userAgent match="MJ12bot/v(?'version'(?'major'\d+)(?'minor'\.\d+))" />
</identification>
<capture>
<userAgent match=";\sident=(?'Ident'\w*)\)" />
<header name="Crawler-Ident" match="(?'Ident'\w*)" />
</capture>
<capabilities>
<capability name="crawler" value="true" />
<capability name="browser" value="MJ12bot" />
<capability name="majorversion" value="${major}" />
<capability name="minorversion" value="${minor}" />
<capability name="version" value="${version}" />
<capability name="ident" value="${Ident}" />
<capability name="tagWriter" value="System.Web.UI.HtmlTextWriter" />
</capabilities>
</browser>
</browsers>


example C# code in
//Checks if the bot is MJ12bot
if (Request.Browser.Browser == "MJ12bot")
{
//Checks if the Ident returned is the one you are expecting.
if (Request.Browser["ident"]!="yoursecretident")
{
//show the bot the door.
Response.StatusCode = 403;
Response.StatusDescription = "403 Forbidden";
Response.SuppressContent = true;
Response.End();
}
}

9:26 pm on Oct 24, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


I am sorry for the delay guys - the new crawler has now been released, it is version 1.3.1 (don't assume it won't increment though!).

The delay was due to a few issues that needed to be tested.

In one week we'll turn off old crawlers - not long to wait.

If you come across with any issues regarding new functionality then please don't hesitate to drop me email directly.

Thank you

Alex

3:25 am on Nov 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


I'm unhappy to report that the first visit I've seen from any 'newer' bot (v1.3.1) neither came from a properly configured machine nor requested robots.txt. Here's the complete ELF access_log entry:

. - - [08/Nov/2009:17:08:54 -0800] "GET / HTTP/1.0" 403 1437 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)"

The bot was initially 403'd due to the 'dot' Host/IP (although had the Host/IP passed muster, it would've been 403'd because of no robots.txt).

I guess the UA could've been spoofed as it lacks the planned-for "ident=SECRETIDENT" part of the string. Thing is, the vast majority of fauxbot-runners I see use routine, even old, versions of bots. Why bother recoding to spoof something this new this soon?

3:40 am on Nov 9, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


And FWIW, just about two hours later, 'visitor' #2 -- also faulty:

webn.tracking202.com
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)

robots.txt? NO

(I won't keep posting these unless the hits change for the better, or worse.)

1:50 am on Nov 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I haven't seen a valid v1.3.0 or v1.3.1 MJ12bot yet, except for a "manual test run" done by Lord Majestic himself. So far, they're all v1.2.5, or what I assume to be v1.3.x spoofers which don't send the crawler-ident in either the user-agent or the HTTP header. So I'm assuming that the roll-out of the new version has been delayed or is going much more slowly than predicted -- or perhaps all the MJ12bots that have ever visited my sites have all been spoofers. :(

Jim

10:44 pm on Nov 10, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Hi,

We've rolled out v1.3.1 and we will be switching off older versions in a week. I am going to go through another test to confirm that 1.3.x definately works though - bugs always possible :(

1:54 pm on Nov 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I'm eagerly awaiting my first v1.3.x request with a valid crawler-ident... :)

Jim

2:07 am on Dec 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Still no news. Still no-robots hits. And now, a no-Host hit, too:

.
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)

robots.txt? NO

(That lone dot is all that appeared in my access_log as the Host.)

3:14 am on Dec 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I'm seeing at least a 50-to-1 ratio of MJ12 spoofers to real requests. The scale of the spoofing problem that faced MJ12 has only become obvious with the release of 1.3.1 and its validation capability, and the number of fake MJ12 requests has surprised (OK, I should say stunned) both me and LordMajestic.

The real MJ12 does fetch and obey robots.txt, but unless you've set your crawler-ident string and are testing for it, you'll never know the real MJ12 from the fakers. And it's now quite clear that there are a *lot* more fakers...

You might want to have a word with your host about that "lone dot" problem; Your server should fall back to giving you the IP address if there is no rDNS available for a requesting IP and it can't give you the hostname.

Jim

4:49 am on Dec 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Actually, I prefer to simply block any Host/IP without a dot, and any Host/IP that's only a dot. (In fact, you kindly provided the rewrite code for the former way back when, tyvm:)
12:31 pm on Dec 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


This spoofing is very worrying - we don't get many reports of it, in fact I was very suprised it was the case - perhaps they go for specific sites of interest?
10:55 pm on Dec 17, 2009 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts: 3289
votes: 19


I got the one below today. No idea if it's genuine or not as I haven't signed up. I block all distributed bots, as I've said before.

UA: Mozilla/5.0 (compatible; MJ12bot/v1.3.2; [majestic12.co.uk...]
IP: 84.168.193.nnn (Deutsche Telekom AG)
Headers: Accept only
No referer

As to why spoofing: if you can't determine whether or not the bot is genuine then it's a field day, and I doubt very much if many web site owners know anything about signing up for it - or even about trapping bots. Look in the logs, hey, the bot's got an info page, must be genuine! "Allow it in robots.txt, which I've just about mastered." Bingo.

12:33 am on Dec 18, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


With the overwhelming number of questionable MJ12bots, IMO many are authentic, but haven't been updated. I don't have any evidence of this, but there are just too many. Out of 50 MJ12bot hits, maybe 2 will be 1.3x and validate w/ the secret ID.
2:53 am on Dec 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 24, 2003
posts:729
votes: 0


I love the bot validation key idea. I've registered and am awaiting my email confirmation.

One question I have is what would be the .htaccess entry we could use to block all fake MJ12bot hits while allowing the real hit with the proper key?

3:44 am on Dec 18, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


@KenB

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP_USER_AGENT} !YOUR-SECRET-ID
RewriteRule !^robots\.txt$ - [F]

Change YOUR-SECRET-ID and allow any custom 403 page if applicable.

4:26 am on Dec 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


MJ12 also supports sending a custom HTTP request header containing the "YOUR-SECRET-ID". This method is somewhat safer since there is absolutely no risk of your secret ID getting published accidentally in log or stats files -- or even being posted in a forum accidentally due to webmaster errors.

It's also much safer if you're in an organization or business arrangement where multiple people have log file access, but not all are supposed to know the value of "YOUR-SECRET-ID".

In order to use the HTTP header method, select that option in the MJ12 setup form, and then use a modified version of the code above:


RewriteEngine On
#
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP:Crawler-Ident} !=YOUR-SECRET-ID
RewriteRule !^robots\.txt$ - [F]

Actually, MJ12 will allow you to select both methods. There's some merit in that for initial testing, but generally no advantage to using both. Use the Crawler-Ident method if you can, otherwise use the User-agent-string method.

Jim

8:23 pm on Dec 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 24, 2003
posts: 729
votes: 0


Thanks!

I've implemented jdMorgan's solution. I like the little extra security it provides. Now if we could just get the other legit bots to implement the same concept to help us stop spoofing bots.

I took a look at the stats provided to us once we log in and they are pretty impressive.

9:51 pm on Dec 18, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


At this time, I think distributed bots could consider implementing ID schemes (yes, that means you, 80legs [webmasterworld.com]). At least most majors (& some minors) can be confirmed/denied depending on their Host/IP (via rDNS/rIP).

How-to info/links here (old but okay): search robots in disguise
(scroll to) But what about crawlers that arenít so well-behaved?
[bing.com...]

5:57 pm on Jan 2, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Well, here's a server farm keeping up with versions. Can't say as I've even seen that with a bot-runner using a faked UA. (Note: The same single letter is obfuscated in the Host name.)

JANUARY, 2010

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

JUNE, 2009 (thread [webmasterworld.com])

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.4; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO

Ironically, June, 2009, was when this thread announced MJ12bot's 'ground breaking validation capability' -- an ident scheme I've not seen requested. Oh, and about robots.txt --

Since v1.2.1, I've not seen any MJ12bot request robots.txt. Numerous versions, scores of Hosts and hits... None. Zero. Nada. Zip.

6:22 pm on Jan 2, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 24, 2003
posts:729
votes: 0


Anyone want to confirm whether or not Pfui's host name is for the legitmate MJ12bot or not (change the asterisk to a 'c')?
This 99 message thread spans 4 pages: 99
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members