brotherhood_of_LAN - 6:26 am on Jul 18, 2013 (gmt 0)
Nice post cpollett, and I hope you're right that hardware will allow for a wider market of people to try and build competing engines that produce quality results in good time.
It sounds like you've worked a lot on the problem on the sheer size of data and what structures can be used.
What are your plans for your search engine?
RE: seed data, blekko has an Amazon public dataset available, listing unique domains and unique URLs alongside some ranking information. The domain list totals 170 million, so the URL list must be pretty huge. Worth knowing if you're looking for a seed list to start with. I remember these kind of threads mentioning using DMOZ as a seed list, which is tiny in comparison.