Forum Moderators: open
What resources do you think I will need? A cigarette paper calculation for my estimate:
If there are 8 billion pages with an average size of 30kb (without graphics) that is 240 terabytes.
So I would need 240 terabytes/100 gb = 2400 hard disks.
If I crawl on a 10mb line that gives me 1,2500,000 bytes per second. To crawl 240 terabytes at that rate would take 2000 days (6 years), so I guess you need a 100Mb/s connection, making 6 months to crawl the web.
how many computers would I need?
For this project all you would need can be found here [onzin.nl].
Not only will you need storage, but you'll need big beefy machines on quick connections to run your spider software, index the pages (to a DB?) and follow each and every link...
Good luck!
Sorry for having fun with your thread Sly - in reality, it's far beyond most of us. You really need a room full of computers connected in an array, and sometimes even different locations. My sys admin gave me a 10-minute session one day, that's how Google accesses their data so quickly.
So I would need 240 terabytes/100 gb = 2400 hard disks.