MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

Forum Moderators: open

Message Too Old, No Replies

MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

System

2:47 am on Jun 21, 2009 (gmt 0)

redhat

< split from [webmasterworld.com...] by incredibill - >

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

Lord Majestic

1:31 pm on Jan 3, 2010 (gmt 0)

Pfui - can you please use our bot email to send me log requests and I'll have a look, it is very difficult to support things when they are posted on a forum that does not allow all necessary details shared.

Since v1.2.1, I've not seen any MJ12bot request robots.txt. Numerous versions, scores of Hosts and hits... None. Zero. Nada. Zip.

Any MJ12bot that does NOT request robots.txt prior to crawling a batch of URLs (usually less than 300) is a FAKE one. That's an easy sign - no robots.txt requested by anyone who claims to be robots.txt then it is fake. Note: in some rare cases such robots.txt's can be cached for up to 24 hours, but this almost never happens.

Once again please - NO robots.txt means 100% fake. If you are unsure we have got email in very large letters on our bot's page right on top - impossible to miss it, please email us and we will always investigate such reports :)

Pfui

8:47 pm on Jan 3, 2010 (gmt 0)

Re KenB's Q: Is the Host mentioned one of your distributors (distributees?), Alex?

Lord Majestic

9:13 pm on Jan 3, 2010 (gmt 0)

I can't say anything about that hostname (we don't record those), however we do record crawler IPs - when that host is translated into IP 67.159.44.103 then it's not one of ours, but we do have one legit IP on that C class subnet.

I find it very difficult to properly support such queries on here, can everyone please who has MJ12bot related queries (fake IP or not, why it did not take robots.txt etc) relate to our email address that we monitor all the time.

Please supply IPs of the requests and URLs that were requested (ideally with robots.txt).

If anyone wants me to run a test on their site (crawl homepage for example) then I'll gladly do so.

KenB

10:44 pm on Jan 3, 2010 (gmt 0)

What happens if one of your servers gets blocked by IP address because it is in a range of blocked IP addresses? For instance in this case the IP being in a range of IP addresses for a server farm and I tend to block these IP ranges. Will your other bots on other IP addresses that aren't blocked pick up the slack?

Lord Majestic

10:55 pm on Jan 3, 2010 (gmt 0)

Will your other bots on other IP addresses that aren't blocked pick up the slack?

Most likely.

IP blocking is not effective with distributed crawlers - note here that we don't know if we got blocked intentionally by IP or it's just server not working, so if you want to disallow our crawling then please use robots.txt.

jdMorgan

11:32 pm on Jan 3, 2010 (gmt 0)

As I posted above, I'm seeing a huge number of fake MJ12 requests. None request robots.txt, and none send along the Crawler-Ident header or the User-agent with my 'secret string' in it. There are a lot of them, and they're all fake.

So in this case, the new MJ12 Crawler Ident method is working exactly as planned, allowing me to reject these requests without any fuss. And the fact that I'm rejecting several of them every day illustrates just how badly-needed this solution was -- Both from our perspective as Webmasters, and from Lord Majestic's perspective of trying to avoid having MJ12 blacklisted as a 'bad bot' and banned from many sites because there are so many imposters.

I sent Lord Majestic a month's worth of logged MJ12 requests, none of which had included Crawler-Idents, and none of which requested robots.txt. He checked the long list of IP addresses, and *none* of them belonged to any member of his volunteer crawler team.

So if you've registered a 'Crawler-Ident phrase' for MJ12 to use when making requests to your site and you get a request without that Crawler-Ident in it, then that's certainly a fake. If you haven't registered with MJ12, then you can tell a fake from a real one only by looking to see if it requested robots.txt any time in the previous 24 hours.

Registering with MJ12 has other benefits. While Lord Majestic has been very good about complying with out TOS here and not promoting his site in these threads, I'd like to say that it *is* worth a look. Go set up your site with an MJ12 Crawler-Ident, and then when you see a purported MJ12 request and it doesn't have your Crawler-Ident, you will know immediately that it's a fake. You will also gain access to some extremely useful data.

I'm amazed that Lord Majestic has time to respond here with all the technical work going on at MJ12 and all the activity resulting from a recent industry award and a new contract... :)

Jim

KenB

11:37 pm on Jan 3, 2010 (gmt 0)

@Lord Majestic,

Actually I don't want to block your bot and have used your secret ID trick above to make sure you can get through. I think your validation method is innovative and it is the primary reason I let your bot in. I'm hoping others will follow your lead and implement the same validation method.

What I do however, is block IP ranges of web hosting server farms as legitimate users/processes almost never come from these sources BUT lots of spam bots and site scrapers do. For instance, I block IP ranges for ThePlanet and Amazon Web Services servers because they are a big source of spam bots.

Lord Majestic

11:45 pm on Jan 3, 2010 (gmt 0)

I sent Lord Majestic a month's worth of logged MJ12 requests, none of which had included Crawler-Idents, and none of which requested robots.txt. He checked the long list of IP addresses, and *none* of them belonged to any member of his volunteer crawler team.

I have to say I was very suprised to see that, so I double and triple checked. Lack of robots.txt requests was additional check - they were all fake, shocking to be honest :(

I hope our actions in implementing this solution have shown good will - sadly by itself it's not removing fake bots that pretend to be us, I wish I knew how to do it (feasibly) :(

Big difficulty with those fake bots is that many seem to be run on compromised PC (botnets), there is nothing we can do about it - people who are responsible for such things are often in countries where laws ain't exactly well respected. :(

Lord Majestic

11:49 pm on Jan 3, 2010 (gmt 0)

@KenB: sure thanks for that :)

We do not use AWS and I doubt any of our members use ThePlanet - some of them do run it from data centers, however majority of crawls that we do should come from non-DC IPs.

Either way if you block some IP chances are our other crawler will get through another IP.

This 99 message thread spans 4 pages: 99