dstiles - 8:06 pm on Jun 23, 2013 (gmt 0)
hitchhiker - a good list but I hope not too bitcoin-oriented: they got compormised a few times recently. :(
Your comment "Distributed crawling would have to be done carefully" would have to be examined carefully. As a webmaster I could accept distributed crawling IF the crawling IPs were known. This excludes dynamic (broadband) based crawlers but could be server-based (eg using spare capacity of ordinary web servers), POSSIBLY with a reverse DNS entry, but the combination of IP and UA should suffice. Another warning, though: do not crawl from a cloud - the IP cannot be readily determined.
It occurs to me that in the early days the SE hosting server may have spare capacity for crawling.
Not sure about a "million accounts" but in any case I'd suggest purging accounts unused for (say) 12 months. On the other hand, allowing the creation of multiple accounts may help in tying down spam sites.
Going back to the old days of "add a site" is a good idea. I've given up (for now) adding (or updating old) sitemaps to sites: G doesn't seem to care and I've never had a problem with the top-3 crawling anyway.
It's a shame that frames are no longer an acceptable part of the web since a "spam" button could be permanently displayed on the frame. However, opening each clicked-on SERP in a new tab or window which would then need to be closed after use would return visitors to the index and a "Report Spam" button against each item. I say "would be returned..." but this depends to a certain extent on the brower setup.
lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative despite most people being unable to submit sites to it. An SE needs crawlers to get content anyway. Human editors just could not cope.