cpollett - 6:12 am on Jul 18, 2013 (gmt 0)
In reply to how feasible would it be to create our own search engine and what things would we need to do it?
To actually do a large scale crawl can be done on a relatively small number of machines. For example, IRLBot did well over a billion pages using just a single machine in 2009. If you look at how the ClueWeb 2012 dataset was collected it again only used a small number of machines -- fewer than were used for ClueWeb 2009. You could also forgo crawling altogether and use public data sets like commoncrawl.org or Internet Archive's datasets. These datasets are large enough that you don't want to just download them. With Clueweb you can pay for hard drives but also need to sign a use agreement.
There are of course open source crawlers/indexers out there that work at web scale such as nutch/lucene/solr, or my own search engine yioop.
The hard part is not getting the pages with a small number of machines -- the hard part is to be able to process the pages to any significant degree. This is slowly changing, the algorithms to fake what the big boys are doing are getting better and of course all hardware is getting faster. As an example on something that is a helped by a lot of machines, consider a conjunctive query. A typical web index has list structures of the form (word, list of documents that the word appeared in). Assume these lists are sorted by some global importance measure like page rank. The lists associated with a word can be for a web crawl in the order of millions to billions of entries. On a single word query you can just return roughly the first x members from the list and be done with it. So the look up time is proportional to the number of results you want.
For a two word query where each word is relatively common but where the two words don't occur often on the same document (for example, earthquake soccer), the time to find documents can be proportional to the length of the shorter list, which might still be millions or billions of entries. This results in super slow results on a small number of machines to answer these queries. Splitting the index (the lists for each word) across many machines, means each machine doesn't have as a long a list to search for intersections.
Using distributed search engine might solve this problem, but has its own headaches.
With a small number of machines, you were largely out of luck until recently. Now you can use a hybrid inverted index/suffix tree where suffix trees are done as in:
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran: Inverted indexes for phrases and strings. SIGIR 2011. 555-564
to get something like a conjunctive query with a fallback to exact string match but in time proportional to the number of results rather than list sizes.
My own feeling is that from a technical perspective it will become increasingly possible for individuals to have their own web scale crawls of the web in the future.