Forum Moderators: open
[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]
Registered but having a problem getting verification file retrieved. Guess Alex went to bed. Sent email.
I'm puzzled why you can claim some 30+% more backlinks than Google.
Where did you see this claim? I can't post links here really, but if you check our site in Research section you'll see chart of our links DB compared with Google, right now we have 697 bln unique urls vs Google's 1 trillion (1000 bln) announced last year, however later today our number should go up to nearly 900 bln. I don't believe we ever claimed that we have more backlinks than in Google's internal database, however we certainly do have more than Yahoo or anyone else who publicly offers such access, if you have any evidence to the contrary then I'd love to see it! :)
Meaning, let's hear more about people that register to validate MJ12Bot and their experiences with this validation process and how they implemented validation server side and whether or not any MJ12Bot fakes were detected.
Thanks.
IanCP, AFAIK Goggle has never displayed *all* backlinks. The reason has been debated for a long time
We learn more each day.
@Lord Majestic, my comments were not to be interpreted as your set up is wrong or whatever.
Simply amazing.
I've never been "befuddled" by statitics over the last 10+ years but all of a sudden someone presents me with figures well above what I had previously known.
In the morning I'll download the file.
For the "uninformed", I don't do nothing after a large and enjoyable "Fathers Day". I wait and do "stuff" the next day.
I have high expectations.
Again, sorry Bill.
I explained only v1.3.0 (or higher) will have new ident feature
I guess my real question was/is will blocking the above crawl and others w/ older bots affect the index stats?
Blocking of our bot (any version) should not affect index stats since we primarily deal with external backlinks:
You're only partially correct here because it affects external index stats to all sites the blocked site links.
Therefore, any site allowing Google or Yahoo to crawl yet blocks MJ12Bot is limiting your backlink information that might appear in their sources.
Hopefully allowing webmasters to validate your bot will encourage more to allow it to crawl thus avoiding those discrepancies.
Hopefully allowing webmasters to validate your bot will encourage more to allow it to crawl thus avoiding those discrepancies.
I hope so - from our side we want to ensure that we give something back to webmasters (alternative to GWT), it's not yet full text search, but next year we'll renew our efforts once we build up list of URLs worthy of inclusion into that index ;)
Is this an authentic MJ12bot or an poser? Would have thought they'd all be updated and the older versions turned off by now.
Jim
I am going to release new crawler tonight - sorry again for this delay, we should have old versions (v1.2.x) turned off within 10 days from now - I'll post here to confirm that happened.
Oops... I *did* get a request, but it received a 403 response...
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+);(ident=myCleverIdentString);
Note that "comments" in the user-agent string should end with the ")" right parenthese, and not with a semi-colon...
That is, the string should be:
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+);(ident=myCleverIdentString)
with no trailing semi-colon.
For now, I'll have to add code to make an exception to the filters...
Jim
I thought you'd just test for existance of myCleverIdentString in user-agent (alongside with testing it is MJ12bot version 1.3. ) as a sign that it is validated?
I can remove trailing ; there of course for release, but it would be better not to depend on it.
Anyhow, it seems that it worked and correct ident was sent :)
The trailing ";" is an issue because it is a syntax error. As in all other HTTP headers, it means "more follows," and is always followed by another token or comment, which usually starts with a space (some exceptions occur in the HTTP Accept and proxy-related headers).
Understand that for some of us who run sites that are 'irresistibly attractive' to scrapers and malicious 'bots, these unwelcome accesses comprise a significant portion of the requests to the server; On this site, which is actually quite small but extremely well-ranked and popular (in its small niche) due to its unique and valuable content, bad-bots make up about 30% of *all* requests. The scrapers just love to copy these pages and plaster ads all over them, and then try to compete with the 'real' site in search.
So from my chair, the Web looks quite unfriendly, and in order to reduce the number of sites I have to chase down and issue DMCA claims about, and in order to reduce bandwidth and time wasted serving unwelcome requests, I filter all requests quite thoroughly; For example, if you try to access the site using a UA of MSIE7 on a Win98 operating system, you get banned because that OS doesn't support MSIE7. Likewise if there is a single little problem with the composition, syntax, or consistency of any HTTP header(s)... 403.
I don't do this out of over-protectiveness or paranoia, I do it because it's a not-for-profit site that I don't get paid to maintain, and I don't have the time to be constantly chasing down abusive and malicious requests and issuing DMCA claims -- there are more important things in life than wasting time doing that. So, I have used it as a 'test platform,' and any filters I develop for it get deployed to all the other sites I work on. This saves me an awful lot of time, and saves clients time and money as well since they then to spend less money and time on legal interventions for copyright issues.
Anyway, yes, there are 'rules' about composing HTTP header strings of all kinds, and it's a good idea to follow them. Now if I might just talk you into removing the unnecessary "?+" at the end of your robot info page URL in MJ12's UA string and considering merging the two comment fields into one that would be nice... :)
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php ident=myCleverIdentString)
I don't mean to be picky, but this saves having to add exceptions for MJ12 to several filters, and possibly on many Web sites other than my own... And it conforms to 'standards.'
Thanks very much for doing all this work to add the crawler verification; Your next MJ12/v1.3.1 requests will get 200-OK responses instead of 403s -- Sorry 'bout that! ;)
Jim
So, I am inclined to keep ?+ bit, but I can put ident right after it if it helps, ie:
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...]
^^^^
From the above link you can see why we use ?+ - if it was not for it, then the link would be broken when pasted to this forum!
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
So
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] (ident=myCleverIdentString)
would all be "valid."
It's your choice, though... I'm just happy to see the Crawler-Ident showing up with /v1.3.1 :)
Jim
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; id:myCleverIdentString; http://www.majestic12.co.uk/bot.php?+)
Used ":" instead of "=" since that's also a convention, and "id" since it's shorter.
Putting things as simply as possible, all "words" should be separated by spaces, and when a new token is inserted, it should be preceded by semi-colon-space and followed by only whatever it was inserted ahead of (such as the next token or a closing parenthese). This implies also that the 'inserion point' is always just before an existing semicolon or closing parenthese, or lacking those, at the end of the string.
So the answer to the question above is that a semicolon-space is a "new token-leading character-sequence" and not a "trailing character-sequence."
Jim
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
?+ bit helps us solve problem of our bot's page being accessible within one click from bad log analysis software, people just to conclusions too quickly these days so we had to do it in order to maximise chances of them getting to our bot's page.
Would you be happy with that? I'll delay release until tomorrow to make these changes to our user-agent.