Forum Moderators: open
UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]
06-12: From: Server Farm
robots.txt? NO
UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.4; [majestic12.co.uk...]
05-18: From: Server Farm
robots.txt? NO
05-07: From: . (that's a dot; misconfigured server)
robots.txt? NO
05-07: From: Server Farm
robots.txt? NO
05-02: From: 205.209.158.num
robots.txt? YES
Fake referer? YES
04-11: From: . (misconfigured server)
robots.txt? NO
Seems to me the long-standing, oft' proffered claim that the bot obeys robots.txt is specious/misleading when the MJ12bot doesn't ask for it in the first place.
[edited by: incrediBILL at 3:22 am (utc) on June 21, 2009]
[edit reason] Removed URL [/edit]
I am going to send Bill info on proposed scheme, it can be discussed on here as well (if anyone interested) - once you are happy that the proposed way is acceptable I'll get it implemented. I think there is a good chance that we'll have this working (in test phase) by mid-August.
SITE #1:
UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.4; [majestic12.co.uk...]
Host: [yada-yada].craw.blueyonder.co.uk
Referers:
[majestic12.co.uk...]
[majestic12.co.uk...]
SITE #2:
UA: Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]
IP: 205.209.170.nn
Referer:
[majestic12.co.uk...]
QUESTIONS:
If those two different 27-character strings (10 nos hyphen 16 nos) were/are each site's 'key' -- how'd two, erm, somebody elses get them? Or were those just more fake bots and fake referers? Or--?
TIA
[edited by: incrediBILL at 8:15 am (utc) on Sep. 7, 2009]
[edit reason] clean up [/edit]
If those two different 27-character strings (10 nos hyphen 16 nos) were/are each site's 'key'
These numbers are not site keys - they are debug params that we only send when getting robots.txt, it allows us to identify "bucket" in which those urls were as well as exact crawler that crawler it. A bucket can contain up to 10k urls from all sort of sites, and same domain can be present in multiple "buckets", hence these numbers can't be used for crawler/site identification even though I've never seen fakes show such referer (they usually don't bother getting robots.txt in the first place).
The new "ident" feature is user-configurable and it will be send consistently by new version of the bot: v1.3.0 (or higher).