Forum Moderators: bakedjake
lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative
-dstiles
Going back to the old days of "add a site" is a good idea.It is likely to be heavily spammed.
a "Report Spam" button against each itemThis is a major problem with building such a search engine - relying on the user to detect spam. It is not a good way of handling things because some of the content that can end up in a blind crawl search may be potentially illegal.
What I have in mind is more a network of directory sites with a common data format and some interchange mechanisms.Something like ODP's RDF format? Again the issue of who makes money from all this arises.
I think "add a site" would be less spammed than auto-find by crawling.The meatbots would be all over it. It is a major issue with web directories and this would be magnitudes larger.
I would only expect a small percentage of users to hit the "spam" button but that's still better than none.Possibly but then you run into the negative SEO issue of people trying to knock out their competitors.
One anti-spam method for both "add a site" and crawl may be to pay attention to the various reports of evil DNS and domain name registrars. Both can be discovered to at least a reasonable degree (which is why I do not understand the domain registration agencies failure to pick up on this).Again it is a moving target. The registrars are not in the business of managing the entitlement of each registrant to a domain name - with gTLDs, they are simply in the bulk domain registration business.
A vast number of the domains registered and dropped each day are registered by criminals and a fair number are used for virus-serving sites for a few days before being rumbled; and some of the more robust ones find their way into search engines. Something else that domain registrars could detect and deflect.I've seen the same claims made before but they are wrong. The vast majority of domains registered each day are registered by ordinary people and businesses. The five day window in which a domain can be dropped without payment was abused but for domain tasting. If a registrar's five day deletes go above a certain percentage each month then they have to pay a percentage of the registration fee. This has significantly reduced the problem. There is also a development curve from the registration date of a domain to a fully functional website appearing (if ever) on that domain name.
Anti-spam/virus measures on an SE would need someone permanently assigned to the problem; that could be a downer on a start-up.If the SE uses the moronic GIGO approach used by Google and the other major players, it would need more than a single person. It is a continually changing threat environment and a link that might have been good yesterday could be hacked today and carrying a malware payload.
There are ways of detecting (most?) auto-submission agents, same as detecting auto-scrapers.Yes but the meatbots sometimes do manual submissions. The auto-subs are easier to detect.
Domain names / DNS - there are indicators in DNS and certainly some DNS servers are very suspect. I agree it would take a lot of work but what is the project's aim - to avoid as much spam as possible. Some DNS servers are "obviously" compromised and could be trapped.This does get back to the "bad neighbourhood" concept and it is certainly a valid one because problem DNSes and website clusters exist. I do a lot of domain / DNS work because of my main website. At the moment, I'm running a full gTLD website IP survey.
I saw some stats that gave the number of criminal domains registered per day and was very surprised.Again, you have to be very careful about these numbers and their sources. An example was that someone was claiming, based on being a network administrator or avid reader of "technology" journalism that most domains registered and dropped within the five day window over the last five years or so were registered for spam. This was absolutely wrong because the complete day's drop on some days was being reregistered by domain tasters. The domains would have websites with PPC advertising and if they made a predetermined amount in those five days, they would be retained and if not dropped again. ICANN was shamed into changing the five day window (where a domain could be dropped without the registrar having to pay the registration fee) regulations. The system was corrupted at a registrar level. However what was being done wasn't technically illegal. The standard drugs/warez/pron sites do exist but some of these may use existing websites rather than new domains. As I said, you've got to be quite cynical when it comes to "technology" journalism because a lot of it consists of recycled press releases and often these press releases come from companies (anti-virus/anti-spam/anti-* ) trying to sell something. The real problem, from a search engine developer's point of view, is the number of compromised websites. Too many compromised websites in an index makes it a toxic index and then you have the same problem as Google.