MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

Forum Moderators: open

Message Too Old, No Replies

MJ12bot v1.3.0 Implements Ground Breaking Validation Capability

System

2:47 am on Jun 21, 2009 (gmt 0)

redhat

< split from [webmasterworld.com...] by incredibill - >

[edited by: encyclo at 11:25 am (utc) on Sep. 7, 2009]

incrediBILL

3:35 am on Sep 4, 2009 (gmt 0)

mix of white listing and UA/IP/refer blocking

I do those things as well, security is like an onion, it's installed in layers.

MJ12Bot just added another layer is all!

keyplyr

9:18 am on Sep 4, 2009 (gmt 0)

OK, when I think about the ability (possibility) to get other distro bots to follow this procedure, then it is worth the couple lines of code. Sorry for the cynical mood. Security topics do that to me, especially when the labor always seems to be on my end.

Registered but having a problem getting verification file retrieved. Guess Alex went to bed. Sent email.

Lord Majestic

11:49 pm on Sep 4, 2009 (gmt 0)

keyplyr: sorry was in London all day, we are just moving into new office (with additional employee!), will sort your stuff out first thing tomorrow!

keyplyr

8:51 am on Sep 5, 2009 (gmt 0)

keyplyr: sorry was in London all day, we are just moving into new office (with additional employee!), will sort your stuff out first thing tomorrow!

No need, got the needed info from tonight's logs, verified successfully now. Thanks.

IanCP

9:24 am on Sep 6, 2009 (gmt 0)

@Lord Majestic you have more than piqued my interest. I'm impressed with what I've seen so far however, I'm puzzled why you can claim some 30+% more backlinks than Google.

No, I haven't downloaded anything yet but will do when I can set time aside.

Thanks anyway.

keyplyr

9:28 am on Sep 6, 2009 (gmt 0)

IanCP, AFAIK Goggle has never displayed *all* backlinks. The reason has been debated for a long time.

Lord Majestic

9:28 am on Sep 6, 2009 (gmt 0)

I'm puzzled why you can claim some 30+% more backlinks than Google.

Where did you see this claim? I can't post links here really, but if you check our site in Research section you'll see chart of our links DB compared with Google, right now we have 697 bln unique urls vs Google's 1 trillion (1000 bln) announced last year, however later today our number should go up to nearly 900 bln. I don't believe we ever claimed that we have more backlinks than in Google's internal database, however we certainly do have more than Yahoo or anyone else who publicly offers such access, if you have any evidence to the contrary then I'd love to see it! :)

incrediBILL

9:52 am on Sep 6, 2009 (gmt 0)

Let's keep the thread on topic about the MJ12Bot validation and not about the features, functions or backlink counts of MJ12.

Meaning, let's hear more about people that register to validate MJ12Bot and their experiences with this validation process and how they implemented validation server side and whether or not any MJ12Bot fakes were detected.

Thanks.

IanCP

10:22 am on Sep 6, 2009 (gmt 0)

Sorry Bill.

IanCP, AFAIK Goggle has never displayed *all* backlinks. The reason has been debated for a long time

We learn more each day.

@Lord Majestic, my comments were not to be interpreted as your set up is wrong or whatever.

Simply amazing.

I've never been "befuddled" by statitics over the last 10+ years but all of a sudden someone presents me with figures well above what I had previously known.

In the morning I'll download the file.

For the "uninformed", I don't do nothing after a large and enjoyable "Fathers Day". I wait and do "stuff" the next day.

I have high expectations.

Again, sorry Bill.

incrediBILL

10:57 am on Sep 6, 2009 (gmt 0)

No problem IanCP, let's just get back on track ;)

Lord Majestic

10:58 am on Sep 6, 2009 (gmt 0)

To stay on topic - we plan to put current beta crawler into production early next week, and hopefully within 10-14 days all our crawlers will run new version that supports this validation technique.

[edited by: incrediBILL at 11:57 am (utc) on Sep. 6, 2009]
[edit reason] clean up [/edit]

keyplyr

8:40 am on Sep 7, 2009 (gmt 0)

71.181.32.** MJ12bot/v1.2.5 just got hundreds of 403s because no unique indent in UA string. Guess this is one of the old ones huh?

[edited by: incrediBILL at 9:03 am (utc) on Sep. 7, 2009]
[edit reason] clean up [/edit]

Lord Majestic

6:35 pm on Sep 7, 2009 (gmt 0)

It is the current one keyplyr, as I explained only v1.3.0 (or higher) will have new ident feature - right now we are testing this new version while the current (or "old" of you like) still runs.

We will start update of all crawlers tomorrow, this might take whole week at least.

incrediBILL

7:59 pm on Sep 7, 2009 (gmt 0)

I explained only v1.3.0 (or higher) will have new ident feature

I added the version number into the thread title to help it be more obvious.

keyplyr

8:13 pm on Sep 7, 2009 (gmt 0)

I explained only v1.3.0 (or higher) will have new ident feature

Sorry, when I posted I was having a difficult time finding all the sections of this discussion since they were being moved at the time.

I guess my real question was/is will blocking the above crawl and others w/ older bots affect the index stats?

Lord Majestic

8:26 pm on Sep 7, 2009 (gmt 0)

Blocking of our bot (any version) should not affect index stats since we primarily deal with external backlinks: it would affect our future full-text index (next year we plan to give it a good go again), but good sites can rank on the basis of anchor text anyway. Our verification checks robots.txt as well, so if our bot is blocked then we can't verify site, so you won't be able to configure this ident string.

incrediBILL

2:01 pm on Sep 8, 2009 (gmt 0)

Blocking of our bot (any version) should not affect index stats since we primarily deal with external backlinks:

You're only partially correct here because it affects external index stats to all sites the blocked site links.

Therefore, any site allowing Google or Yahoo to crawl yet blocks MJ12Bot is limiting your backlink information that might appear in their sources.

Hopefully allowing webmasters to validate your bot will encourage more to allow it to crawl thus avoiding those discrepancies.

Lord Majestic

6:11 pm on Sep 9, 2009 (gmt 0)

Hopefully allowing webmasters to validate your bot will encourage more to allow it to crawl thus avoiding those discrepancies.

I hope so - from our side we want to ensure that we give something back to webmasters (alternative to GWT), it's not yet full text search, but next year we'll renew our efforts once we build up list of URLs worthy of inclusion into that index ;)

Lord Majestic

12:20 pm on Sep 13, 2009 (gmt 0)

Just a quick note - tonight we'll start upgrading all our distributed crawlers, this might take up to 2 weeks as they are run by volunteers and they need to do it manually, however once this period is over we'll turn off all crawlers, so that there will be one version of MJ12bot v1.3.0 (or higher in the future) that supports this new feature.

keyplyr

1:05 am on Oct 3, 2009 (gmt 0)

71.10.72.25 - - [01/Oct/2009:20:12:13 -0700] "GET www.example.com HTTP/1.1" 403 936 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.2.5; [majestic12.co.uk...]

Is this an authentic MJ12bot or an poser? Would have thought they'd all be updated and the older versions turned off by now.

jdMorgan

1:54 am on Oct 3, 2009 (gmt 0)

I haven't seen a /v1.3.0 MJ12bot using my validation ident string yet. I've had several MJ12bot/v1.3.0 'bot visits in the past couple of weeks, but none with the ident string in either the user-agent string or in the Crawler-Ident header (I configured both for initial testing). So, I don't know if these were spoofers, or if the roll-out has just been delayed.

Jim

Lord Majestic

12:52 pm on Oct 3, 2009 (gmt 0)

Sorry for delay guys - we had to test new client for longer, one bug was found - new crawler version will be v1.3.1, I just run tests with idents sent to all participating sites, please check your logs...

I am going to release new crawler tonight - sorry again for this delay, we should have old versions (v1.2.x) turned off within 10 days from now - I'll post here to confirm that happened.

jdMorgan

1:53 pm on Oct 3, 2009 (gmt 0)

LordM,

Oops... I *did* get a request, but it received a 403 response...

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+);(ident=myCleverIdentString);

Note that "comments" in the user-agent string should end with the ")" right parenthese, and not with a semi-colon...

That is, the string should be:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+);(ident=myCleverIdentString)
with no trailing semi-colon.

For now, I'll have to add code to make an exception to the filters...

Jim

Lord Majestic

3:12 pm on Oct 3, 2009 (gmt 0)

Why would trailing ; be an issue?

I thought you'd just test for existance of myCleverIdentString in user-agent (alongside with testing it is MJ12bot version 1.3. ) as a sign that it is validated?

I can remove trailing ; there of course for release, but it would be better not to depend on it.

Anyhow, it seems that it worked and correct ident was sent :)

jdMorgan

4:07 pm on Oct 3, 2009 (gmt 0)

Yes, the correct Crawler-Ident value was sent and validated in both the UA string and the Crawler-Ident header.

The trailing ";" is an issue because it is a syntax error. As in all other HTTP headers, it means "more follows," and is always followed by another token or comment, which usually starts with a space (some exceptions occur in the HTTP Accept and proxy-related headers).

Understand that for some of us who run sites that are 'irresistibly attractive' to scrapers and malicious 'bots, these unwelcome accesses comprise a significant portion of the requests to the server; On this site, which is actually quite small but extremely well-ranked and popular (in its small niche) due to its unique and valuable content, bad-bots make up about 30% of *all* requests. The scrapers just love to copy these pages and plaster ads all over them, and then try to compete with the 'real' site in search.

So from my chair, the Web looks quite unfriendly, and in order to reduce the number of sites I have to chase down and issue DMCA claims about, and in order to reduce bandwidth and time wasted serving unwelcome requests, I filter all requests quite thoroughly; For example, if you try to access the site using a UA of MSIE7 on a Win98 operating system, you get banned because that OS doesn't support MSIE7. Likewise if there is a single little problem with the composition, syntax, or consistency of any HTTP header(s)... 403.

I don't do this out of over-protectiveness or paranoia, I do it because it's a not-for-profit site that I don't get paid to maintain, and I don't have the time to be constantly chasing down abusive and malicious requests and issuing DMCA claims -- there are more important things in life than wasting time doing that. So, I have used it as a 'test platform,' and any filters I develop for it get deployed to all the other sites I work on. This saves me an awful lot of time, and saves clients time and money as well since they then to spend less money and time on legal interventions for copyright issues.

Anyway, yes, there are 'rules' about composing HTTP header strings of all kinds, and it's a good idea to follow them. Now if I might just talk you into removing the unnecessary "?+" at the end of your robot info page URL in MJ12's UA string and considering merging the two comment fields into one that would be nice... :)

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php ident=myCleverIdentString)

I don't mean to be picky, but this saves having to add exceptions for MJ12 to several filters, and possibly on many Web sites other than my own... And it conforms to 'standards.'

Thanks very much for doing all this work to add the crawler verification; Your next MJ12/v1.3.1 requests will get 200-OK responses instead of 403s -- Sorry 'bout that! ;)

Jim

Lord Majestic

4:45 pm on Oct 3, 2009 (gmt 0)

The reason we put ?+ was to increase chances of people getting to our bot's page: from our logs some years ago it was clear that some log analysis software was showing link with appended ) and it was resulting in 404 Not Found error, so we started using this approach as we wanted to ensure people get to our bot's page if they want to.

So, I am inclined to keep ?+ bit, but I can put ident right after it if it helps, ie:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...]

^^^^

From the above link you can see why we use ?+ - if it was not for it, then the link would be broken when pasted to this forum!

jdMorgan

6:25 pm on Oct 3, 2009 (gmt 0)

The problem with that is that in makes the "ident" look like a query string to be attached to the MJ12 URL. That's a slight additional risk to the Webmaster, since he may then 'transmit' his crawler ident in the request. I'm not sure what the concern about 'breaking the link' when posting is... That happens all the time, and we just deal with it.If you use a space, then I doubt the stats and log analysis programs will get confused -- this is the method that 'everybody else' uses... space between token 'words,' semicolon and space between complete 'tokens' (but not after the last token), parentheses around complete "comments."

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
So
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString
or
Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] (ident=myCleverIdentString)
would all be "valid."

It's your choice, though... I'm just happy to see the Crawler-Ident showing up with /v1.3.1 :)

Jim

Pfui

7:04 pm on Oct 3, 2009 (gmt 0)

Just casting my vote for any of Jim's suggested strings, please. The snafus he detailed (thank you!) -- like the 'no space after a semi-colon' bad bot tell: );( -- are not insignificant. Neither is recoding one, let alone multiple, root and directory level .htaccess files to allow one bot. More work for the server, too.

jdMorgan

8:22 pm on Oct 3, 2009 (gmt 0)

To avoid changing your UA too much from your historical precedents, you could also consider inserting the ident after MJ12 and before the URL:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; id:myCleverIdentString; http://www.majestic12.co.uk/bot.php?+)

Used ":" instead of "=" since that's also a convention, and "id" since it's shorter.

Putting things as simply as possible, all "words" should be separated by spaces, and when a new token is inserted, it should be preceded by semi-colon-space and followed by only whatever it was inserted ahead of (such as the next token or a closing parenthese). This implies also that the 'inserion point' is always just before an existing semicolon or closing parenthese, or lacking those, at the end of the string.

So the answer to the question above is that a semicolon-space is a "new token-leading character-sequence" and not a "trailing character-sequence."

Jim

Lord Majestic

8:25 pm on Oct 3, 2009 (gmt 0)

We can do this:

Mozilla/5.0 (compatible; MJ12bot/v1.3.1; [majestic12.co.uk...] ident=myCleverIdentString)

?+ bit helps us solve problem of our bot's page being accessible within one click from bad log analysis software, people just to conclusions too quickly these days so we had to do it in order to maximise chances of them getting to our bot's page.

Would you be happy with that? I'll delay release until tomorrow to make these changes to our user-agent.

This 99 message thread spans 4 pages: 99