Forum Moderators: open

Message Too Old, No Replies

MJ12bot/v0.5.0

MJ12bot is a prototype web-crawling robot.

         

pendanticist

3:04 am on Nov 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



82.36.78.78 - - [29/Oct/2004:08:58:14 -0700] "HEAD /robots.txt HTTP/1.1" 200 0 "-" "MJ12bot/v0.5.0 (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) run by PeerID=F43AA089D42A3C7610E3778C4E73A95E MemberID=C9403BB515387FBB631AC512950E6F0E"
82.36.78.78 - - [29/Oct/2004:08:58:15 -0700] "GET /robots.txt HTTP/1.1" 200 1705 "-" "MJ12bot/v0.5.0 (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) run by PeerID=F43AA089D42A3C7610E3778C4E73A95E MemberID=C9403BB515387FBB631AC512950E6F0E"
82.36.78.78 - - [29/Oct/2004:08:58:15 -0700] "GET / HTTP/1.1" 200 20402 "-" "MJ12bot/v0.5.0 (http://www.majestic12.co.uk/projects/dsearch/mj12bot.php) run by PeerID=F43AA089D42A3C7610E3778C4E73A95E MemberID=C9403BB515387FBB631AC512950E6F0E"

MJ12bot- A Web Article by Janet Systems [janetsystems.co.uk]

Not sure I understand the Peer and Member ID parts...

bull

10:23 pm on Nov 13, 2004 (gmt 0)

10+ Year Member



You could ask our member Lord_Majestic [webmasterworld.com] directly.
Obviously it is his bot, too bad that there is no preview of the new "biggest search engine in the world".

Lord Majestic

11:25 pm on Nov 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Eh, you guys quick, the moment of fame came faster than I expected ;)

Obviously it is his bot, too bad that there is no preview of the new "biggest search engine in the world".

"Rome was not built in one day" ;)

Crawling is the biggest bottleneck that I have, so I need to get it tuned before switching to indexing of data that it collects. PeerID allows to identify unique peer (machine) that was running the bot, and MemberID will allow to locate a particular member was running bot on one or more peers. I will add shortly ID of the URLs batch that the bot was crawling to be able to respond to any issues reported with ease.

I am pleased to see that robots.txt was requested, and hope the bot did obey the rules you had (it should!). You may notice that initially I requested robots.txt with the HEAD request. GET request will be executed (and log shows that it was) only if HEAD reports that robots.txt is actually present. This is a feature of the bot designed to improve performance (and save you bandwidth on 404s), I think its pretty unique, but feel free to correct me on this one.

MJ12bot supports gzip'ped pages, so if you have compression enabled then it could save you and me a good deal of bandwidth.

regards,

Alex

p.s. Would appreciate if you edit out the last digits of the IP address....

pendanticist

11:47 pm on Nov 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That would explain the link to a thread here at WebmasterWorld. :)

So, ah, what does that ID stuff mean? Anything we should be concerned about?

Not gonna let everyone and their Uncle use this thing like grub are you?

Just curious....

Lord Majestic

11:50 pm on Nov 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not gonna let everyone and their Uncle use this thing like grub are you?

You think fast -- it will be better than grub ;)

IDs are designed to trace errors to specific peers or members. Should they not play the ball by the rules of the house, then they will be denied URL batches and thus stop crawling.

regards,

Alex

pendanticist

8:43 pm on Nov 22, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How and who makes that determination and what will it mean to the webmaster whose material has already been scraped?

Lord Majestic

10:15 pm on Nov 22, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How and who makes that determination and what will it mean to the webmaster whose material has already been scraped?

Its no different to how major search engines operate - since the system obeys robots.txt [1] and robots meta tags [2], ultimately its webmaster's choice whether pages remain in index or not.

[1] + [2] almost make up Three Laws of Robotics, if anyone can come up with the [3]rd then please let me know (sticky or something).

pendanticist

10:28 pm on Nov 22, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, each visit by the bot will have and contain all that ID stuff and thusly each user of the bot will leave it's very own distincitve IP Number?

Lord Majestic

10:20 am on Nov 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, each visit by the bot will have and contain all that ID stuff and thusly each user of the bot will leave it's very own distincitve IP Number?

IPs will very different -- I think its safe to say there will be no easily identifiable range of IPs from which bot comes from. This is not to beat cloaking by IP (I'd be flattered if it went that far!), but inherent feature of the system.

The ID stuff is currently designed to allow easy audit of the system, it may or may not go in the future -- a lot will depend on feedback from people like yourself.

All these are implementation details, principal goal is to follow best practices set by major search engines augmented by feedback from people on the ground.

mcneely

1:02 am on Jan 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well then mate

The 3rd law of robotics.

Beside the two you had previously mentioned........

#3....Bring in tea and cakes for everyone........

Lord Majestic

1:11 am on Jan 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



#3....Bring in tea and cakes for everyone........

Good law - give us sometime and I hope there will be enough cakes, tea and coffee with rum for everybody ;)

To stay on topic -- we have recently gone through a number of version changes, and currently stand at v0.7.5. This will certainly change, and fast, but incrementally due to bug fixes not necesserily relating directly to pulling pages from website since our crawler is more a bit more complicated than that.