TypicalSurfer - 8:12 pm on Oct 9, 2012 (gmt 0)
Right, I should have used ccTLD in my post, specifically .uk for this discussion and that doesn't negate the need for a high quality seed list. If you cut a crawler loose on a junky seed list, you'll only crawl junk.
You'd also need some type of domain depth setting. Initially keep it low and only harvest a lowish amount of links per document. That way you don't end up in the weeds. Once you get going it's pretty easy to spot "trusted" and easier yet to spot spam. Once you get a grip on "trusted" you move those to a different crawl and go deep. Keeping a crawler out of the ditch is a bit of a trick but can be done with some practice.