joined:Mar 4, 2003
It has to allow for error, not try to be perfect.
Hey, great topic, I've been thinking about this a lot:
It should use a 'ledger' system to create a secure, 'untamperable' audit of everything. Similar to bitcoin. The index and user base would be stored in this, and completely transparent and readable. (Personal security is not affected due to users just being 'GUIDs')
Distributed crawling would have to be done carefully.
Reverse-Index would need (unfortunately) to be stored on every node. (Index wouldn't be as big as you think, but this part definitely means it's a project that isn't quite ready for prime-time this year.. 5 years)
I would prefer mass-volume user ranking above everything else algo-wise. (Good outnumbers SPAM, so the larger the user-base, the more accurate that signal) - I could harp on about why for hours.
Very simple algos, community has to learn from the mistake Google has made. Vote signals rule above everything, but basic algos help earlier on and in general to give a start pattern for new serps.
No more random crawling, you have sitemap, you register your site, or we don't index it. Any site not doing that probably can be ignored these days.
Prefer positive user signals over negative ones (by a considerable margin). Possible positive rating and [SPAM warn] buttons only. No negative rating. Mainly because people voting 'negative' tend to do it for weirder reasons than people who are 'satisfied' in some way. The 'positive' signal is stronger.
SPAM will exist, but user feedback (the larger the user base grows) would eventually make it difficult. Unlike what we have now, the amount of useful users would far outnumber the amount of BHs faking users. BH would be almost impossible like this, permanently. They could rig voting, but again, the user base would be an order of magnitude greater than anything a single group, or forum could muster up.
Users can search completely anonymously, but are unable to vote like that.
User = GUID = does not need to be tied to a name. (Same as Bitcoin)
You can create any amount of accounts automatically, a million if you like. We base our techniques for 'anti-BH' on the fact that we accept this.
Perhaps some IP restriction on volume account creation. That would be a little 'gray' but may be sensible to avoid bloating the ledger with BH attempts.
Users vote power increases as time passes, and as votes are cast by an account that the network accepts. Again, people spamming votes with a million accounts would be up against a wall with that.
Open source, obviously. Community of developers, and community of experts make and vote on changes over time. Anybody can fork, anybody with a good idea should be able to prove it works.
Public nodes allow for the system to be used as a web-service. Anybody is able to do that, if they modify the results people can decide whether they want to use that 'service' or not.
Non-crawling nodes for those on networks that would have a problem with that.
Crawling in general would be the best place to inject spam, so new sites have to be registered with the network first. "add a new site" - at that point they are sandboxed for a short period. Slowing down some BH (that wouldn't work anyway). Then the nodes can begin crawling.
Method for segmenting crawling tasks would be needed, probably a prior scan of the sitemap, and then a series of requests to random nodes to perform. Results audited, with user accounts - used to detect and lock out bad accounts if needed.
Nodes cannot request these tasks, it has to be assigned to a random node associated with a known user (reduces the chance of a BH node getting a task it wants).
That's all I can think of for now, totally immature process - just a start.