jmccormac - 7:05 pm on Oct 9, 2012 (gmt 0)
London might be a good test for such a search engine.
jmcc, a formidable task but one that you're familiar with. Why not try a city, like London to start off with? The UK is a big country and makes the task bigger, London would be a good 'proof of concept'.
The seedlist of sites is approximately 8 million. This will fall significantly with the application of some web survey classification algorithms. (The ones I use for classifying web usage in TLDs). This kind of approach is different to the conventional search engine approach of spidering everything and then trying to clean the index.
The first issue I see is knowing where to spider and the redundancy of discarding 'non-UK' sites. I guess you would want to start off with a seed-list of sites and follow the links a couple of levels deep and you could have a good proportion of related sites.
The UK webspace is also more complex than some in that it has a few hosters that are transnational hosters covering a number of countries. This causes an adjacent markets effect where countries selling into the UK market will have registered a .uk domain name.
The redundancy issue (clone websites), is one with which Google has problems. Once the clone websites get into the index, it becomes very difficult to solve. However the web usage algorithms actually solve it before indexing. However as the UK is a mature market in web terms, a high percentage of UK websites will be hosted in the UK.