Forum Moderators: open
The limiting factor is not servers, it's bandwidth. You'd need a very fat pipe to GET all those pages, and a very fat wallet to pay for it... about $1.5 Million U.S. [webmasterworld.com] -- Oh, and a big disk farm, too.
Jim
Regards...jmcc
I would suggest you ask these questions in a more appropriate forum as you're more likely to get prompt and knowledgeable replies.The Alternative Search Engines forum would be the best place as some Tier 2 and Tier 3 search engine operators hang out there (sometimes) and engage in long threaded conversations like this: [webmasterworld.com...]
Regards...jmcc
thanks, let's assuming you can use the most scalable & distributed & efficient architecture, it seems need lot's of machines if $ is not an issue :-)A near infinite amount of money or just a googol? :)
so how many (minimum) number of this commodity server (as my first msg)Right so basing it on some simple social science type numbers (value free numbers):
needed to finish crawling?
At one page per second you would require approximately 41336 weeks.
The problem is that it is not a straight linear equation because:
a: most of the web is discovered by crawling.
b: a lot of the web is dynamic.
c: you have to build the index organically over a period greater than a week.
d: in addition to the search aspect, you have to have a processing aspect and a DNS and site acquisition aspect. This is the most complex part of it.
(Could a moderator please move this to the Alternative Search Engine forum? )
Regards...jmcc
ouch...jmcc...that was sharp! But true...at least in the geek realm...it still hurt! ;)Couldn't resist it. :) I am sitting here watching a few servers (Linux) checking about 70% of the .eu domains registered for websites so I have a good view of spiders in action. Even so these spiders can grind the servers down.
If it is a high school homework, it shows that the teacher or lecturer really does not understand the web. (There often seems to be a huge gulf between academia and the real world as regards search.) Most search engines work by crawling pages and extracting links and following them. So it becomes a layered process. That's why spidering the web in a week is not really feasible - the links have to be built up for a typical search engine.
Regards...jmcc
appreciate someone sharing more knowledge on this aspect.
At one page per second you would require approximately 41336 weeks
I's say 1000 of these servers, each connected to 100mbit could do the job.
you can lease a 100mbit server for around $1.2k/mo => 100Gbit/sec combined bandwidth, ~$1M costs/mo