Forum Moderators: open

Message Too Old, No Replies

how many servers we need to crawl the whole web within ONE week?

crawl web

         

clement

7:13 am on Aug 7, 2006 (gmt 0)

10+ Year Member



given near 25billion pages indexed in typical popular search engine,
I just wonder,
how many servers(say low end commodity Dual Xeon 3Ghz 2GB mem server)
would be needed in order to crawl the whole web contents within ONE week?

best regards
Clement

jdMorgan

5:10 pm on Aug 7, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> to crawl the whole web contents within ONE week?

The limiting factor is not servers, it's bandwidth. You'd need a very fat pipe to GET all those pages, and a very fat wallet to pay for it... about $1.5 Million U.S. [webmasterworld.com] -- Oh, and a big disk farm, too.

Jim

jmccormac

9:46 pm on Aug 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Designing and building a large-scale search engine is a very complex task, even for people who know what they are doing. Crawling the entire web in a week would take a hell of a lot of bandwidth and a massive (acres of PCs) resources. Start small - try reading about simple search engines such as MnoGosearch and Nutch first. Then start reading up on the metrics of the web (as in how many active websites there are and the different gTLDs and ccTLDs). There is a lot more to running a top level search engines than it first appears.

Regards...jmcc

clement

1:24 am on Aug 12, 2006 (gmt 0)

10+ Year Member



thanks, let's assuming you can use the most scalable & distributed & efficient architecture, it seems need lot's of machines if $ is not an issue :-)
so how many (minimum) number of this commodity server (as my first msg)
needed to finish crawling?

GaryK

1:47 am on Aug 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would suggest you ask these questions in a more appropriate forum as you're more likely to get prompt and knowledgeable replies.

I'd suggest trying the, Website Technology Issues, or the, Webmaster Hardware forums here on WW.

:)

jmccormac

2:10 am on Aug 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would suggest you ask these questions in a more appropriate forum as you're more likely to get prompt and knowledgeable replies.
The Alternative Search Engines forum would be the best place as some Tier 2 and Tier 3 search engine operators hang out there (sometimes) and engage in long threaded conversations like this: [webmasterworld.com...]

Regards...jmcc

jmccormac

2:25 am on Aug 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thanks, let's assuming you can use the most scalable & distributed & efficient architecture, it seems need lot's of machines if $ is not an issue :-)
A near infinite amount of money or just a googol? :)

so how many (minimum) number of this commodity server (as my first msg)
needed to finish crawling?
Right so basing it on some simple social science type numbers (value free numbers):
25 billion pages = 25 ^9 pages
each week has 604800 seconds (7*24*60*60)

At one page per second you would require approximately 41336 weeks.

The problem is that it is not a straight linear equation because:
a: most of the web is discovered by crawling.
b: a lot of the web is dynamic.
c: you have to build the index organically over a period greater than a week.
d: in addition to the search aspect, you have to have a processing aspect and a DNS and site acquisition aspect. This is the most complex part of it.

(Could a moderator please move this to the Alternative Search Engine forum? )

Regards...jmcc

BaseVinyl

3:33 am on Aug 12, 2006 (gmt 0)

10+ Year Member



A server cannot crawl...please don't ask us to do your high school homework!

jmccormac

4:05 am on Aug 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A server cannot crawl
Obviously not a Microsoft user then. :)

Regards...jmcc

BaseVinyl

4:20 am on Aug 12, 2006 (gmt 0)

10+ Year Member



ouch...jmcc...that was sharp! But true...at least in the geek realm...it still hurt! ;)

jmccormac

4:31 am on Aug 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ouch...jmcc...that was sharp! But true...at least in the geek realm...it still hurt! ;)
Couldn't resist it. :) I am sitting here watching a few servers (Linux) checking about 70% of the .eu domains registered for websites so I have a good view of spiders in action. Even so these spiders can grind the servers down.

If it is a high school homework, it shows that the teacher or lecturer really does not understand the web. (There often seems to be a huge gulf between academia and the real world as regards search.) Most search engines work by crawling pages and extracting links and following them. So it becomes a layered process. That's why spidering the web in a week is not really feasible - the links have to be built up for a typical search engine.

Regards...jmcc

clement

5:08 am on Aug 13, 2006 (gmt 0)

10+ Year Member



thanks for all the educational replies.
Sorry my mis-wording of "server", was means for machines, e.g. the 1RU HW platforms; as oppose to the http server program.
The reason I am asking is that some news mentioning about some leading SE has average of 20K server/machines in each of totally of near 500 datacenter locations worldwide.
Just wonder out of these huge volumes of machines,
understand the "layered"/staging processing architecture from "crawling->parsing->indexing/ranking" etc;
how many of the machines need to do the crawling job,
vs. how many machines doing other jobs.

appreciate someone sharing more knowledge on this aspect.

freeflight2

5:28 am on Aug 13, 2006 (gmt 0)

10+ Year Member



At one page per second you would require approximately 41336 weeks

using this number: a dual xeon can serve/receive 1k+ requests per second if there's a fast enough memory / storage backend => 41 servers would be the absolute minimum if you already have a list of all URLs - just to get the contents without doing anything with it.

I's say 1000 of these servers, each connected to 100mbit could do the job.

you can lease a 100mbit server for around $1.2k/mo => 100Gbit/sec combined bandwidth, ~$1M costs/mo