Forum Moderators: open
[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]
Based on the 'timing' observed so far, I suspect it may be another week before MJ12 participants upgrade their 'nodes' and the newest MJ12 version starts showing up in my logs.
Other than the 'test run' already cited, is anybody else seeing MJ12bot/v1.3.1 (or higher) requests with the crawler-ident in them yet?
Jim
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=SECRETIDENT)
We do seem to get 403s from what I believe is jdMorgan's website, will wait to investigate it with him before releasing new crawler - better test it right before release...
<identification>
<userAgent match="MJ12bot/v(?'version'(?'major'\d+)(?'minor'\.\d+))" />
</identification>
<capture>
<userAgent match=";\sident=(?'Ident'\w*)\)" />
</capture>
<capabilities>
<capability name="crawler" value="true" />
<capability name="browser" value="MJ12bot" />
<capability name="majorversion" value="${major}" />
<capability name="minorversion" value="${minor}" />
<capability name="version" value="${version}" />
<capability name="ident" value="${Ident}" />
<capability name="tagWriter" value="System.Web.UI.HtmlTextWriter" />
</capabilities>
</browser>
</browsers>
Consistent with the example ident in the code above, the format of the new Crawler-Ident header is:
MJ12bot/v1.3.1; ident:SECRETIDENT Jim
<identification>
<userAgent match="MJ12bot/v(?'version'(?'major'\d+)(?'minor'\.\d+))" />
</identification>
<capture>
<userAgent match=";\sident=(?'Ident'\w*)\)" />
<header name="Crawler-Ident" match="(?'Ident'\w*)" />
</capture>
<capabilities>
<capability name="crawler" value="true" />
<capability name="browser" value="MJ12bot" />
<capability name="majorversion" value="${major}" />
<capability name="minorversion" value="${minor}" />
<capability name="version" value="${version}" />
<capability name="ident" value="${Ident}" />
<capability name="tagWriter" value="System.Web.UI.HtmlTextWriter" />
</capabilities>
</browser>
</browsers>
example C# code in
//Checks if the bot is MJ12bot
if (Request.Browser.Browser == "MJ12bot")
{
//Checks if the Ident returned is the one you are expecting.
if (Request.Browser["ident"]!="yoursecretident")
{
//show the bot the door.
Response.StatusCode = 403;
Response.StatusDescription = "403 Forbidden";
Response.SuppressContent = true;
Response.End();
}
}
The delay was due to a few issues that needed to be tested.
In one week we'll turn off old crawlers - not long to wait.
If you come across with any issues regarding new functionality then please don't hesitate to drop me email directly.
Thank you
Alex
. - - [08/Nov/2009:17:08:54 -0800] "GET / HTTP/1.0" 403 1437 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)"
The bot was initially 403'd due to the 'dot' Host/IP (although had the Host/IP passed muster, it would've been 403'd because of no robots.txt).
I guess the UA could've been spoofed as it lacks the planned-for "ident=SECRETIDENT" part of the string. Thing is, the vast majority of fauxbot-runners I see use routine, even old, versions of bots. Why bother recoding to spoof something this new this soon?
Jim
The real MJ12 does fetch and obey robots.txt, but unless you've set your crawler-ident string and are testing for it, you'll never know the real MJ12 from the fakers. And it's now quite clear that there are a *lot* more fakers...
You might want to have a word with your host about that "lone dot" problem; Your server should fall back to giving you the IP address if there is no rDNS available for a requesting IP and it can't give you the hostname.
Jim
UA: Mozilla/5.0 (compatible; MJ12bot/v1.3.2; [majestic12.co.uk...]
IP: 84.168.193.nnn (Deutsche Telekom AG)
Headers: Accept only
No referer
As to why spoofing: if you can't determine whether or not the bot is genuine then it's a field day, and I doubt very much if many web site owners know anything about signing up for it - or even about trapping bots. Look in the logs, hey, the bot's got an info page, must be genuine! "Allow it in robots.txt, which I've just about mastered." Bingo.
It's also much safer if you're in an organization or business arrangement where multiple people have log file access, but not all are supposed to know the value of "YOUR-SECRET-ID".
In order to use the HTTP header method, select that option in the MJ12 setup form, and then use a modified version of the code above:
RewriteEngine On
#
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteCond %{HTTP:Crawler-Ident} !=YOUR-SECRET-ID
RewriteRule !^robots\.txt$ - [F]
Jim
How-to info/links here (old but okay): search robots in disguise
(scroll to) But what about crawlers that aren’t so well-behaved?
[bing.com...]
JANUARY, 2010
te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO
JUNE, 2009 (thread [webmasterworld.com])
te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO
te02.te*hentrance.*om
Mozilla/5.0 (compatible; MJ12bot/v1.2.4; http://www.majestic12.co.uk/bot.php?+)
robots.txt? NO
Ironically, June, 2009, was when this thread announced MJ12bot's 'ground breaking validation capability' -- an ident scheme I've not seen requested. Oh, and about robots.txt --
Since v1.2.1, I've not seen any MJ12bot request robots.txt. Numerous versions, scores of Hosts and hits... None. Zero. Nada. Zip.