Forum Moderators: open
[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]
Since v1.2.1, I've not seen any MJ12bot request robots.txt. Numerous versions, scores of Hosts and hits... None. Zero. Nada. Zip.
Any MJ12bot that does NOT request robots.txt prior to crawling a batch of URLs (usually less than 300) is a FAKE one. That's an easy sign - no robots.txt requested by anyone who claims to be robots.txt then it is fake. Note: in some rare cases such robots.txt's can be cached for up to 24 hours, but this almost never happens.
Once again please - NO robots.txt means 100% fake. If you are unsure we have got email in very large letters on our bot's page right on top - impossible to miss it, please email us and we will always investigate such reports :)
I find it very difficult to properly support such queries on here, can everyone please who has MJ12bot related queries (fake IP or not, why it did not take robots.txt etc) relate to our email address that we monitor all the time.
Please supply IPs of the requests and URLs that were requested (ideally with robots.txt).
If anyone wants me to run a test on their site (crawl homepage for example) then I'll gladly do so.
Will your other bots on other IP addresses that aren't blocked pick up the slack?
Most likely.
IP blocking is not effective with distributed crawlers - note here that we don't know if we got blocked intentionally by IP or it's just server not working, so if you want to disallow our crawling then please use robots.txt.
So in this case, the new MJ12 Crawler Ident method is working exactly as planned, allowing me to reject these requests without any fuss. And the fact that I'm rejecting several of them every day illustrates just how badly-needed this solution was -- Both from our perspective as Webmasters, and from Lord Majestic's perspective of trying to avoid having MJ12 blacklisted as a 'bad bot' and banned from many sites because there are so many imposters.
I sent Lord Majestic a month's worth of logged MJ12 requests, none of which had included Crawler-Idents, and none of which requested robots.txt. He checked the long list of IP addresses, and *none* of them belonged to any member of his volunteer crawler team.
So if you've registered a 'Crawler-Ident phrase' for MJ12 to use when making requests to your site and you get a request without that Crawler-Ident in it, then that's certainly a fake. If you haven't registered with MJ12, then you can tell a fake from a real one only by looking to see if it requested robots.txt any time in the previous 24 hours.
Registering with MJ12 has other benefits. While Lord Majestic has been very good about complying with out TOS here and not promoting his site in these threads, I'd like to say that it *is* worth a look. Go set up your site with an MJ12 Crawler-Ident, and then when you see a purported MJ12 request and it doesn't have your Crawler-Ident, you will know immediately that it's a fake. You will also gain access to some extremely useful data.
I'm amazed that Lord Majestic has time to respond here with all the technical work going on at MJ12 and all the activity resulting from a recent industry award and a new contract... :)
Jim
Actually I don't want to block your bot and have used your secret ID trick above to make sure you can get through. I think your validation method is innovative and it is the primary reason I let your bot in. I'm hoping others will follow your lead and implement the same validation method.
What I do however, is block IP ranges of web hosting server farms as legitimate users/processes almost never come from these sources BUT lots of spam bots and site scrapers do. For instance, I block IP ranges for ThePlanet and Amazon Web Services servers because they are a big source of spam bots.
I sent Lord Majestic a month's worth of logged MJ12 requests, none of which had included Crawler-Idents, and none of which requested robots.txt. He checked the long list of IP addresses, and *none* of them belonged to any member of his volunteer crawler team.
I have to say I was very suprised to see that, so I double and triple checked. Lack of robots.txt requests was additional check - they were all fake, shocking to be honest :(
I hope our actions in implementing this solution have shown good will - sadly by itself it's not removing fake bots that pretend to be us, I wish I knew how to do it (feasibly) :(
Big difficulty with those fake bots is that many seem to be run on compromised PC (botnets), there is nothing we can do about it - people who are responsible for such things are often in countries where laws ain't exactly well respected. :(