Forum Moderators: open

Message Too Old, No Replies

What resources to index the whole web?

         

SlyOldDog

9:01 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was thinking about indexing the whole web (not graphics, just text).

What resources do you think I will need? A cigarette paper calculation for my estimate:

If there are 8 billion pages with an average size of 30kb (without graphics) that is 240 terabytes.

So I would need 240 terabytes/100 gb = 2400 hard disks.

If I crawl on a 10mb line that gives me 1,2500,000 bytes per second. To crawl 240 terabytes at that rate would take 2000 days (6 years), so I guess you need a 100Mb/s connection, making 6 months to crawl the web.

how many computers would I need?

grandpa

3:11 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Its rumored that a 1 terabyte drive is close at hand. So you can cut some of that hardware significantly - just 240 hard disks.

Chndru

4:32 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think a slightly difficult question is, how will you know, when you have crawled all their is to crawl? considering the dynamics of the web, that poses a bigger challenge, i think,

lawman

4:47 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You'll know when you reach THIS [mythologic.net] page. :)

cmatcme

4:51 pm on Apr 5, 2005 (gmt 0)

10+ Year Member



Where's the beginning of the internet then?

cmatcme

[edited by: me at now (my time zone) on today ]
[edit reason: corrected grammar ]

lawman

5:03 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I haven't found it yet.

rocknbil

6:21 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do believe this was the beginning [slac.stanford.edu].

For this project all you would need can be found here [onzin.nl].

cmatcme

10:17 am on Apr 6, 2005 (gmt 0)

10+ Year Member



Found the download all the internet page but it closed the window.

cmatcme

SlyOldDog

11:00 am on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the very interesting comments on the history of the net guys :)

Any more ideas on the hardware?

BlobFisk

11:06 am on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You're taking on an ambitious and very large project there! Depending what you want to do with the data you should look at what Google use!

Not only will you need storage, but you'll need big beefy machines on quick connections to run your spider software, index the pages (to a DB?) and follow each and every link...

Good luck!

rocknbil

5:50 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



cmatcme - that's by design. :-) It does the "drive a" thing and closes when you click OK to further frustrate the visitor.

Sorry for having fun with your thread Sly - in reality, it's far beyond most of us. You really need a room full of computers connected in an array, and sometimes even different locations. My sys admin gave me a 10-minute session one day, that's how Google accesses their data so quickly.

Milamber

6:33 pm on Apr 6, 2005 (gmt 0)

10+ Year Member



So I would need 240 terabytes/100 gb = 2400 hard disks.

Just wait 7 years or so, then you should only need 1 hard drive.
[news.bbc.co.uk...]

gamiziuk

10:14 pm on Apr 6, 2005 (gmt 0)

10+ Year Member



Gee, you will need BOTH Cable and DSL to get working on this project...
;)

httpwebwitch

1:27 am on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if you're only downloading text you could save a lot of disk space by removing all the punctuation from the internet as you download it i recommend you should get a few dozen really big raid arrays not pc hard drives you could try kazaa too

Leosghost

10:29 am on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would you want it with or without the adsense on the pages?

SlyOldDog

1:33 pm on Apr 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Heh, I just need to store the html links to spider onto the next pages and I need the contact pages. Maybe it's easier than I thought.