Welcome to WebmasterWorld Guest from 188.8.131.52
jmcc, a formidable task but one that you're familiar with. Why not try a city, like London to start off with? The UK is a big country and makes the task bigger, London would be a good 'proof of concept'.London might be a good test for such a search engine.
The first issue I see is knowing where to spider and the redundancy of discarding 'non-UK' sites. I guess you would want to start off with a seed-list of sites and follow the links a couple of levels deep and you could have a good proportion of related sites.The seedlist of sites is approximately 8 million. This will fall significantly with the application of some web survey classification algorithms. (The ones I use for classifying web usage in TLDs). This kind of approach is different to the conventional search engine approach of spidering everything and then trying to clean the index.
The directory model has failed so many times it's a little bit tragic though. DMOZ - spammed to hell (they never reacted to market changing around them and the model didn't scale). Yahoo - died a slow death (again, they never reacted with the market changes). Any link directory - was always spam. Yellow Pages, etc - slow to embrace technology, slow to adapt their model. But I do think a directory model of some sorts could work.This is part of the approach that I was looking at. As it would be deeply linked to the search engine, a site getting knocked out of the index may well have the directory entry knocked out too. Though Bing has been improving over the last year or so (or Google has been declining), the last thing I would want to do is adopt its approach to local search. The vortal approach to niche search is also a possibility. It is one of the things that I've tested out on a small (approximately than 2K sites) scale.
Your initial crawl could just be bone-headed in terms of TLD, that would give a good base, a decent document collection.It looks like the most obvious approach but it is also a very dangerous one when it comes to a country level search engine because spam, out of area sites, holding pages and PPC parked sites make their way into the index. While the UK is ccTLD positive (more local ccTLD domains registered than com/net/org/biz/info etc), there is still a significant number of non .uk websites that would be excluded in a TLD restricted crawl.