| 2:29 pm on Jun 21, 2009 (gmt 0)|
|Seems to me the long-standing, oft' proffered claim that the bot obeys robots.txt is specious/misleading when the MJ12bot doesn't ask for it in the first place. |
Every time I've added the isBanned property to these bots I get a sticky or an e-mail claiming it does too obey robots.txt and then one of the bots comes calling and sure enough it reads and respects robots.txt. So I'm left to wonder, if it can obey robots.txt, why doesn't it do so all the time.
| 3:25 pm on Jun 21, 2009 (gmt 0)|
As much of a thorn this gent has been to me!
He was a longtime (may still be) a participant of this forum.
With the above in mind and out of courtesy for his previous contributions, I seem to recall he explaining how to determine the difference (in the UA) between somebody using a rogue version of his bot and the current version.
Unfortunately I failed to mark the comment.
Suggest searching the archives for his User profile.
| 5:42 pm on Jun 21, 2009 (gmt 0)|
@GaryK: I'm familiar with at least one self-installed engine, Thunderstone's Webinator, that lets the installer override its default config [thunderstone.com] to request/heed robots.txt. And I recall there are numerous offline downloaders/plugins that also allow something akin to 'ignore robots.txt' (via commenting out a line, etc.). I do not know about the MJ12bot distributions.
@wilderness: Apparently there was a rogue version, v1.08, dating back to 2007/8. And there may be others [majestic12.co.uk], just as we all see spoofed Googlebots and the like. That said, I reckon my OP's "robots.txt? NO" sampling from v1.2.4 and v1.2.5 were not all rogue bots.
Point is, I'm simply reporting another bot that hits many of my sites and the vast majority of the time does not request robots.txt.
| 5:55 pm on Jun 21, 2009 (gmt 0)|
|Point is, I'm simply reporting another bot that hits many of my sites and the vast majority of the time does not request robots.txt. |
In spite of my stated objectivity and courtesy?
I've had this bot denied from my sites since its inception.
| 6:41 pm on Jun 21, 2009 (gmt 0)|
| 6:53 pm on Jun 21, 2009 (gmt 0)|
OK, I was in a huge rush earlier, but now I do seem to recall his explanation about how to distinguish bots under his control from the distributed ones.
Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]
Distributed version, or possibly spoofed:
I still think it would be nice if his bots all used a common IP address or had a common base rDNS.
| 2:19 am on Jun 23, 2009 (gmt 0)|
MJ12bot is ours and it obeys robots.txt just fine - in fact we've added another layer of robots.txt filtering (it happens during url load to cut off such urls early for domains with more than few thousand urls on them) and also robots.txt is checked by crawler just before a batch of 300-400 urls is crawled.
The problem is that there are fake bots around that claim to be MJ12bot and other user-agents. We are NOT responsible for them - there is nothing we can do about them, it's the same as spammers who use fake From: address.
We have bot's email address on our bot's page (in HUGE FONT too right on top of page so nobody can miss it) - it's monitored all the time, we get about one or two emails every month, sometimes its fake, sometimes wrong robots.txt.
Pfui, if you could please email me those log requests with IP addresses then I'll do a quick check and post here whether they were fake.
Let me stress - our bot DOES support robots.txt since start (we had a few buts in early years but now support is solid). If you have evidence to the contrary PLEASE contract us first before making such allegations.
How to know it's genuine bot? First it will request (and obey) robots.txt. Secondly it will have referer set for that robots.txt request that allows us to trace which "url bucket" was crawled by whom.
|I still think it would be nice if his bots all used a common IP address or had a common base rDNS. |
We simply can't have it - on our website we explain what we do and how - we use distributed crawlers run by volunteers all around the web, it's a community project - we just can't have single IP address or subnet. :(
| 2:28 am on Jun 23, 2009 (gmt 0)|
|And I recall there are numerous offline downloaders/plugins that also allow something akin to 'ignore robots.txt' (via commenting out a line, etc.). I do not know about the MJ12bot distributions. |
Our distribution is binary and it does not allow disabling of robots.txt - it can't be run on its own, because data is fed from the central server.
Here are internal stats for this month (Jun 2009) up until now:
URLs crawled: 6,474,464,286 (that's 6.4 bln)
URLs not crawled due to robots.txt limits: 92,233,930 (1.42% of total) - if we ignored robots.txt we would not have these at all!
Additionally (as explained above) we exclude robots.txt's earlier than actual crawl.
This month we had 5 people who emailed us on our bot's email address (I exclude spam here). That's about 1 email per 1 bln urls crawled (I think it is a pretty good ratio).
The breakdown of emails is as follows:
1 was report about fake bot that claimed to be MJ12bot
2 were incorrect reports about supposedly fake bot (it was ours and we could see it took robots.txt just fine)
1 was report about our bot violating robots.txt - this was due to error in robots.txt and webmaster accepted that.
1 was request to stop crawling one website (which we did using additional layer on top of our robots.txt compliance).
Total: 5 (five).
Do you think someone like us who crawls now nearly 10 bln urls every month won't get slaughtered everywhere if we totally ignored robots.txt?
I am prepared to swear an affidavit (if you are paying for it) that the above is true to the best of my knowledge :)
It's a lot easier and smarter for us to follow robots.txt and avoid lots of aggravation from people like you - we tried very hard to be good netizen, but we can't do anything about faked user-agents. I wish I could - sadly those guys often use botnets, so we can't even sue them - and we get all the flak :(
[edited by: Lord_Majestic at 2:49 am (utc) on June 23, 2009]
| 5:43 pm on Jun 23, 2009 (gmt 0)|
Fair is fair, and so I'd like to point out that there are several fake Googlebots running around these days, yet there isn't much Google-bashing going on here for some reason.
I see the MJ12bot requests with the "crawl bucket" info, such as
fetching robots.txt and complying with it. But there are plenty of requests without that crawl bucket info that don't fetch robots.txt and so don't comply with it. So I think that testing for that crawl bucket in the referrer plus the "Accept" info as documented on your robots info page may suffice (for now) to discern fakes from real.
|138.47.102.xx - - [17/Jun/2009:01:59:10 -0600] "GET /robots.txt HTTP/1.0" 200 3456 "http://majestic12.co.uk/bot.php?I=AB94713199-DA4EFA39FE4156D2-www_example_com" "Mozilla/5.0 (compatible; MJ12bot/v1.2.4; http://www.majestic12.co.uk/bot.php?+)" |
Thanks for posting. If there is any further info you can provide to discern real from fake MJ12bots, such as additional HTTP header info (such as Accept-Charset, Accept-Encoding, Accept-Language, Connection, From) etc., please put it on your 'bot info page.
I've sent a stickymail with the info I have logged for the fake requests.
| 5:53 pm on Jun 23, 2009 (gmt 0)|
I just checked all those IPs that you PMed me and they all appear fake :(
Our bot will go away if it gets 403 Forbidden when trying to get robots.txt as standard recommends in this case assuming that all urls are disallowed for crawling.
|yet there isn't much Google-bashing going on here for some reason. |
Well, that does not suprise me - it's one rule for the big guys and completely another for everyone else. :(
Think about it logically - if we were faking user-agent, why on earth would we maintain our site with email address etc etc? It seems logical to fake browser's user-agent in any case.
I think maybe we will start setting new HTTP header to indicate we are legit bot, if anyone steals that then we'd know that there is a deliberate smear campaign against us and will try to sue those people.
Another solution could be for us to provide quick HTTP interface that would give OK/FAKE response to queried IP - unfortunately we can't publish complete list of IPs that we have because it is growing all the time (as well as other reasons).
| 6:04 pm on Jun 23, 2009 (gmt 0)|
where is the data what you collect used ?
| 6:16 pm on Jun 23, 2009 (gmt 0)|
It's on our site - please follow the bots' link.
Let's not turn this into offtopic however - our legit bot does not ignore robots.txt, it appears that published cases when it happens are due to bots faking our user-agent.
| 11:00 pm on Jun 23, 2009 (gmt 0)|
|Fair is fair, and so I'd like to point out that there are several fake Googlebots running around these days, yet there isn't much Google-bashing going on here for some reason. |
Jim, unless I'm mistaken anyone who falls for a fake Googlebot isn't taking the time to do full round-trip DNS lookup. That's why I asked if it were possible for MJ12Bot to use some kind of common root rDNS. If it were possible there would be no basis for questioning his bot much like there's no basis for questioning Googlebot. I asked and got my answer.
| 11:07 pm on Jun 23, 2009 (gmt 0)|
AFAIK Google itself fakes its user-agent when it sends bots to verify if site is cloaking or not. Microsoft search does the same I believe and I am pretty sure Yahoo does it too. I am not sure in this case it would be possible to rDNS them, anyone tried?
In any case, while I appreciate that you have expectation that search engines will come from fixed IPs or subnets, this is not the case for us - there is nothing wrong with it, it's just not something you expect.
It would have been nice if there was a standard plugin/module deployed in all webservers that would auto ban bots that either did not read robots.txt or disobey it.
| 5:03 am on Jun 24, 2009 (gmt 0)|
|It would have been nice if there was a standard plugin/module deployed in all webservers that would auto ban bots that either did not read robots.txt or disobey it. |
It's a good idea in theory. In practice it's not doable. Cause with automation it's not always possible to ensure what you're banning is really a bot.
| 10:52 am on Jun 24, 2009 (gmt 0)|
|In practice it's not doable. |
It's doable if user-agent is set to one of the known bots (like ours) but does not actually obey robots.txt - it would be much harder to apply same logic to faked browser user-agents though.
| 8:12 pm on Jun 24, 2009 (gmt 0)|
1.) Do all legit MJ12bot requests always include referers with the following format?
2.) I get more hits from apparently fake MJ12bots than actually fake Googlebots. I don't report fake Googlebots because when it comes to "Search Engine Spider Identification," they're undeniably, and easily deniable, fakes.
3.) When 'big guys' Google, MSN, Yahoo run any new, cloaked, iffy, or ill-behaved bot, there's always detailed reportage here about the bot(s), the use/misuse/abuse of collected data, how to check for and correct bad or improperly retrieved data, etc.
[edited by: Pfui at 8:18 pm (utc) on June 24, 2009]
| 8:36 pm on Jun 24, 2009 (gmt 0)|
|1.) Do all legit MJ12bot requests always include referers with the following format? |
All current versions do that when requesting robots.txt only - referer is not set for all other urls (unless it is redirect): we tried that couple of years ago and webmasters were not too happy (it's generally impossible to please everyone, but in this case number of displeased people was too high).
We are going to think of new HTTP header to be send for every request that would not pollute normal logs but allow quick check to see if its fake bot or not.
We are not happy about these fake bots, unfortunately there is not a lot we can do about it: if we had solid proof that a UK based company (other than ours!) is engaged in such activity then we'd most certainly take legal action against them.
Word of warning here though - don't jump to conclusion on whether some MJ12bot is fake or not, we get a couple of emails a month with people thinking something is fake when it's not only because they look up IPs and see a University or something like that.
| 1:53 am on Jun 25, 2009 (gmt 0)|
The problem with all distributed bots is that there is NO way of telling if they are genuine. You say it's genuine if it hits robots.txt: that can be faked, as witness myriad reports on awstats bots.
In any case relying on robots.txt to determine if a bot is genuine or not means that web site security software has to be able to trap and parse accesses to that file and pass-through the IP for an indeterminate period of time. How? This is not a particularly feasible solution on many sites, even if the site can parse hits on robots.txt in the first place, which a lot can't. It is far easier to check the UA on the first page access and kill it with a 403. And, in my case, kill the IP as well.
A few years ago mj12bot bot sent me demented, hitting at high speed in long swathes from many IPs. Whether it was genuine or not I had no idea and didn't care. It was blocked then and remains blocked now, as are all distributed bots.
Incidentally, at the time of writing your site does not respond and according to robtex there is no A record for it.
| 2:37 am on Jun 25, 2009 (gmt 0)|
At the time of this post, the MJ12 site URL in the user-agent strings posted above resolve fine here (Texas).
| 1:21 pm on Jun 25, 2009 (gmt 0)|
|The problem is that there are fake bots around that claim to be MJ12bot and other user-agents. We are NOT responsible for them - there is nothing we can do about them |
This whole discussion has happened over and over and it's quite silly after all this time.
Claiming to be not responsible for all the fake bots crawling our sites doesn't fly as we have no means to validate their authenticity. If they too asked for robots.txt and honor it as well, we have no clue if it's real or a really good fake.
All MJ12BOT needs to do is issue a site verification code to site owners that care about this problem and then stick that site verification code in the referrer for each subsequent access by MJ12BOT to our site.
We could then block fakes in our .htaccess files by simply checking if it's MJ12BOT and the verification code matches.
Nobody would be able to fake this because that code would only be known by MJ12BOT and the site owner himself, the fakers would be locked out that easily.
You already allow people to validate ownership of their sites for your MJ12 SEO site so this obviously isn't much of a stretch to tie this all together quickly.
As a matter of fact, you could even supply the couple of lines of Apache code people would need to verify MJ12BOT when they get their site validated to make sure the code works properly.
What say you Lord Majestic, fix this once and forever with a simple validation key so that we who care about what crawls or not can put the fake MJ12BOT nonsense behind us for good?
|Word of warning here though - don't jump to conclusion on whether some MJ12bot is fake or not |
I don't, until it can be validated I simply treat them all equal.
Let us help you crawl, give us validation.
[edited by: incrediBILL at 1:56 pm (utc) on June 25, 2009]
| 2:04 pm on Jun 25, 2009 (gmt 0)|
|Fair is fair, and so I'd like to point out that there are several fake Googlebots running around these days, yet there isn't much Google-bashing going on here for some reason. |
That's because back in 2006 when Dan Thies and I were on a panel about 'bots and we both went on a rant at SES San Jose, Google's home turf, they stepped up and gave us a method to validate their bots.
By the time PubCon Vegas '06 rolled around several SEs had full round-trip DNS verification and shortly most of the big players followed suit.
That's why you don't hear any complaints about fake Googlebots, Slurp, MSNbot and many others because they stepped up to the plate, gave us the tools we asked for, and the complaining stopped because the problem ceased to exist.
It's the simple situation, those big companies wanted webmasters to feel safe allowing their crawler to access webmaster sites, knowing with confidence we weren't feeding some vermin that would use our pages for all sorts of unscrupulous things.
Now that we have complete confidence in the solution, no fakes allowed, problem solved.
Wouldn't you think anyone else that is serious about crawling the web would also want to solve this problem as well, regardless of whether the crawler was distributed or not, just to instill confidence in their crawler?
The method I described for MJ12BOT validation in the previous post could even be used by that cesspool of crawlers all running under the AWS shared IPs as well.
[edited by: incrediBILL at 2:12 pm (utc) on June 25, 2009]
| 3:01 pm on Jun 25, 2009 (gmt 0)|
|In any case relying on robots.txt to determine if a bot is genuine |
If bot obeys your robots.txt then in my it is a good legit bot in its own right.
|Incidentally, at the time of writing your site does not respond and according to robtex there is no A record for it. |
We had 20 minutes downtime this night from 2:24 to 2:44 am GMT.
|Whether it was genuine or not I had no idea and didn't care. It was blocked then and remains blocked now, as are all distributed bots. |
We respect your block (I hope you allow us retrieve robots.txt to see it). What we say is that we can't be held responsible for actions of other people who have nothing to do with us and just use faked user-agents: I get email spam every day with fake user-agents but I won't take it out on users whose email addresses were faked.
| 3:08 pm on Jun 25, 2009 (gmt 0)|
|Claiming to be not responsible for all the fake bots crawling our sites doesn't fly |
I am not trying to make it fly, I am just making a statement that (obviously) we are not responsible for fakes - Pfui could have validated his allegations very quickly by emailing us on the bots address provided (it is given right on top of the page in HUGE letters so that nobody misses it).
|What say you Lord Majestic, fix this once and forever with a simple validation key so that we who care about what crawls or not can put the fake MJ12BOT nonsense behind us for good? |
Ok, I think we could provide such capability (it won't scale to millions of sites initially, but I assume only a handful of people will use it anyway) - allow site owners on our bots page to give secret keycode that will be used by our crawler when accessing their site - this can be sent as an HTTP header.
If I get this implemented, would you be satisfied?
| 3:39 pm on Jun 25, 2009 (gmt 0)|
Putting it in the HTTP header works, but I think it should also be traceable and duplicated either in your user agent field or in the referer field.
Considering the HTTP headers aren't trapped by most webmasters having the validation field appear in the actual log file will assist those of us that do postmortem analysis of bad bot behavior.
Even Googlebot, Slurp, etc. can be analyzed postmortem with a simple rDNS check of the IP stored in the log file.
If MJ12BOT leaves similar breadcrumbs it could save us from making some wrong assumptions and easily end these types of threads once and for all.
As a matter of fact, I would suggest that whatever deal MJ12BOT hammers out with webmasters should be solid enough that it could be made a standard for other distributed crawlers as well, such as the crawlers using AWS mentioned previously.
You could do PR announcements, invite others to join you in adopting it to stop fake crawlers, etc., it could get some traction ;)
[edited by: incrediBILL at 3:47 pm (utc) on June 25, 2009]
| 3:42 pm on Jun 25, 2009 (gmt 0)|
Ok, I can put this into user-agent since it will be done for sites that requested it - long time ago when it all started we were putting things into user-agent and webmasters hate it because it distorted stats, but since in this case webmaster requests this then I suppose it's fine.
One risk here is though some logs may become public, so secret key won't be so secret, but I guess most people who fake user-agents won't go that far anyway - if they really want to crawl your specific site they will probably pretend to be a browser or something.
| 3:55 pm on Jun 25, 2009 (gmt 0)|
|As a matter of fact, I would suggest that whatever deal MJ12BOT hammers out with webmasters should be solid enough that it could be made a standard for other distributed crawlers as well, such as the crawlers using AWS mentioned previously. |
Yes, I agree - it's best to do it in a way that is fairly standard rather than MJ12bot specific. The permissions webmaster give should be bot specific however, ie: it's up to webmaster to decide if particular bot is acceptable enough. Secret keys will probably need to be unique on per bot per site basis.
Let's start small though with focus on trying to get it to work in the first place.
If this is the solution that will keep you guys happy, then I can give you my committment to get it implemented. Timewise I put it in my diary for end of July, but it will take few weeks to update all bots - I think before end of summer it can be operations. Are you ok with timeline? I'll try to push it out quicker, just got other committments until at least middle of July. :(
btw, I will be at Pubcon in London on the 4th ;)
| 5:00 pm on Jun 25, 2009 (gmt 0)|
That timeline works for me.
It would probably help if you can test it directly with some of our members in this thread, lot of people with good expertise, before releasing it to the wild to make sure it's working well in Linux and Windows-based servers.
[edited by: incrediBILL at 7:45 pm (utc) on June 25, 2009]
| 5:10 pm on Jun 25, 2009 (gmt 0)|
Ok, it's a deal then!
I am going to send you a message at the time when I have ready beta test - it won't take too long since I certainly have an interest to give big disincentive for fakers to steal our user-agent :(
[edited by: incrediBILL at 9:28 pm (utc) on June 25, 2009]
[edit reason] fixed formatting [/edit]
| This 38 message thread spans 2 pages: 38 (  2 ) > > |