brotherhood_of_LAN - 12:20 pm on Oct 9, 2012 (gmt 0)
Indeed, .uk domains, UK IP addresses & even so far as looking at keywords in domains seem to be the more trivial aspects of it but kind of precision would it get? A pure guess is it would give about 50% of relevant sites that need to be spidered. It totally depends on how many UK sites are out there that are not hosted in the UK or have a .uk domain.
If the % isn't high enough then more crawling has to happen and the discarding won't be an option until it's spidered.
No doubt some techniques could be used at this point too though, maybe after X number of pages have been spidered, decide whether it's UK-related or not. I'd want to be careful here though, as some portals may have UK specific content buried a few levels deep.
A lot depends on what you would define as a 'UK site', e.g. a Malaysian website promoting package holidays to the UK, in our out?
btw an idea for a seed list, use something like Majestic and enter a bunch of UK specific search terms.