Majestic search engine - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Majestic search engine

Majestic search

Dave_A

10:37 pm on Jan 5, 2005 (gmt 0)

10+ Year Member

They have set up a forum in the UK it can be found at www.majestic12.co.uk and they appear to be trying to set up a search engine in the same way as SETI is trying to find alien life forms, by using heaps of peoples computers to run a robot to crawl web sites, from some of the comments within the forum they don't have an operational search engine at the moment.
They appear to be drawing on a list of web sites from DMOZ so it may well be slightly off target.
If they are crawling web sites and don't have an operational search engine I feel that they are just eating at people's bandwidth.
One should be aware that any web spider (Mine included) will eat at bandwidth and should obey any robots.txt files found, I am not sure if they are aware of the ammount of threads they are using and they don't appear to know what number of files they may be pulling across anyone web servers.
Can a search engine that is so far spread out be able to control it's spidering? and what will happen to the data that they index?
As one of New Zealand's larger search engines I am well aware of the damage that a web crawler can do to bandwidth, my Linknzbot will obey robots.txt files and only open one thread at a time into a server, I want to be able to make the web site searchable not pull it down and suck the guts out of it.
It may well be worth everyone watching what this one does next?

All the best
Guys and Girls.
Dave Andrews

volatilegx

2:20 pm on Jan 6, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

See [webmasterworld.com...]

Lord Majestic

12:32 am on Jan 7, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If they are crawling web sites and don't have an operational search engine I feel that they are just eating at people's bandwidth.

Err, Dave, cut us some slack please -- even Microsoft did not have search engine operational until they had some data to base it on. Since bandwidth is the biggest bottlenack, its a simple matter of priorities that dictate having distributed crawler up and running before turning to search engine part of the puzzle. We will release public version of search engine as soon as we get it up and running.... which will happen much sooner then you might suspect!

We fully support robots.txt, and that includes Crawl-Delay that should prevent "sucking bandwidth" that you referred to. We don't do that (unless it a bug and you have evidence to show that it is a bug), so please stop implying that we are doing something we clearly don't.

Can a search engine that is so far spread out be able to control it's spidering?

Yes it can -- thats why it took 3 months of full-time effort rather than a few weeks. Distributed crawlers do not follow links on their own -- they are being issued links from central server, and this central server groups (and dedupes) sites together in the same "bucket", so that number of distributed crawlers that will try to crawl any given site will be ... about 1. Can there be a bug? Always possible, but design principles that were adopted favoured "good netizen behaviour" rather than "simplicity of programming".

I am not sure if they are aware of the ammount of threads they are using and they don't appear to know what number of files they may be pulling across anyone web servers.

If you are not sure, then you could have asked that question in our forum. Since you asked it here publicly I will have to answer it here too -- we ARE AWARE of the amount of threads we are using, and we certainly are aware of number of files that we are pulling from peoples web servers. Hell, how would we not if we have public individual stats a la SETI@HOME?

It is a design feature of crawler to use 1 (one) connection to any given site (compare that to maximum of two mandated by HTTP/1.1), and we use in-crawler "pipelining" of sites, so that forum.example.com will be treated in the same pipeline as www.example.com, this is to avoid having more than one connection to what is likely to be hosted on the same box. PITA to program that was, but it helps overall crawler performance and keeps bot from overloading websites.

They appear to be drawing on a list of web sites from DMOZ so it may well be slightly off target.

DMOZ had nice sweet links to sites that were actually up (unlike junk from DNS zone files we are going through now) - christ I loved the smell of those links in the morning... ;) ... sadly we finished with those 2 months ago :(

I don't know what you meant by "slightly off target", but AFAIK links from DMOZ are pretty much on target for any decent world wide web search engine.

As one of New Zealand's larger search engines

A feature of our distributed crawler is to ask for preferred domains by suffix, such as .NZ or .UK, I hope there will be plenty of your fellow compatriots who will take advantage of it -- you certainly most welcome to do so ;)

It may well be worth everyone watching what this one does next?

Might be more worthwhile joining rather than just watching....

regards

AlexC

[edited by: Lord_Majestic at 12:59 am (utc) on Jan. 7, 2005]

mcneely

12:59 am on Jan 7, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Majestic slid through many of our sites without incident about 3 weeks ago.

Appeared to be compliant to the robots.txt almost to the point of wiping it's feet before coming through the door.

Didn't stay for dinner though, collected what was needed and showed itself to the door.

Wish they all did that.

Dave_A

8:33 am on Jan 7, 2005 (gmt 0)

10+ Year Member

Hi everyone,
maybe I should extend a warm hand towards the guys who are setting up the Majestic search engine.
I feel that one thing they should do is inform the www.robotstxt.org web site of the robots details and user agent name and signature, they may have already done this but registration of web spiders and bots seems to take an age to do.
Webmasters should be treated like women, it's always best to be polite and let them at least know your name before you go rooting around there undergrowth (Hosts) sly grin..

Lord Majestic

12:42 pm on Jan 7, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I feel that one thing they should do is inform the www.robotstxt.org web site of the robots details and user agent name and signature, they may have already done this but registration of web spiders and bots seems to take an age to do.

Already informed... three times.... months ago! :(

IMO they are either too busy or something else. Either way every request includes URL to site that explains what the robot is doing.

thanks

regards

alexc