hitchhiker - a good list but I hope not too bitcoin-oriented: they got compormised a few times recently. :(
Your comment "Distributed crawling would have to be done carefully" would have to be examined carefully. As a webmaster I could accept distributed crawling IF the crawling IPs were known. This excludes dynamic (broadband) based crawlers but could be server-based (eg using spare capacity of ordinary web servers), POSSIBLY with a reverse DNS entry, but the combination of IP and UA should suffice. Another warning, though: do not crawl from a cloud - the IP cannot be readily determined.
It occurs to me that in the early days the SE hosting server may have spare capacity for crawling.
Not sure about a "million accounts" but in any case I'd suggest purging accounts unused for (say) 12 months. On the other hand, allowing the creation of multiple accounts may help in tying down spam sites.
Going back to the old days of "add a site" is a good idea. I've given up (for now) adding (or updating old) sitemaps to sites: G doesn't seem to care and I've never had a problem with the top-3 crawling anyway.
It's a shame that frames are no longer an acceptable part of the web since a "spam" button could be permanently displayed on the frame. However, opening each clicked-on SERP in a new tab or window which would then need to be closed after use would return visitors to the index and a "Report Spam" button against each item. I say "would be returned..." but this depends to a certain extent on the brower setup.
lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative despite most people being unable to submit sites to it. An SE needs crawlers to get content anyway. Human editors just could not cope.
|lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative |
Sort of, but not exactly. I was DMOZ editor for several years so, some of my thoughts are based on the ODP model.
What I have in mind is more a network of directory sites with a common data format and some interchange mechanisms.
It is likely to be heavily spammed.
|Going back to the old days of "add a site" is a good idea. |
This is a major problem with building such a search engine - relying on the user to detect spam. It is not a good way of handling things because some of the content that can end up in a blind crawl search may be potentially illegal.
|a "Report Spam" button against each item |
Something like ODP's RDF format? Again the issue of who makes money from all this arises.
|What I have in mind is more a network of directory sites with a common data format and some interchange mechanisms. |
What most people do not see with the web is that it changes from day to day. Thousands of domains drop and thousands more are registered. One of the biggest issues with DMOZ was that it had no ability to self-clean its index. Domains that were in DMOZ were often dropped and reregistered. They remained there despite often no longer having the original content or owner. This kind of thing also affects SEs.
I think "add a site" would be less spammed than auto-find by crawling.
I would only expect a small percentage of users to hit the "spam" button but that's still better than none.
One anti-spam method for both "add a site" and crawl may be to pay attention to the various reports of evil DNS and domain name registrars. Both can be discovered to at least a reasonable degree (which is why I do not understand the domain registration agencies failure to pick up on this).
A vast number of the domains registered and dropped each day are registered by criminals and a fair number are used for virus-serving sites for a few days before being rumbled; and some of the more robust ones find their way into search engines. Something else that domain registrars could detect and deflect.
Anti-spam/virus measures on an SE would need someone permanently assigned to the problem; that could be a downer on a start-up.
Lotts of great idea's coming through. Please keep them coming.
Ok so my first problem with a new open source search engine is it has no data, nothing so when people come to it they are bound to be disappointed. So I used a Bing script to bring in some results so that we wouldn't have an empty search result. These results come in below the results for own database.
Next I needed a crawler, open source, this will need to be replaced with a P2P crawler but to start with I grabbed a download of Sphider.And did a quick crawl of w3c.
Next I needed a way to get people to add there sites and thought of the suggestions about human review and Dmoz. In the end I thought the easiest way would be for people to bookmark them. The sites could then be voted up or down and those that gained positive votes would then be crawled. So I downloaded SemanticScuttle for this.
Lotts more to come, I have added the url of the site to my profile. I can't and don't want to do this alone so please get involved. Anyone out there that can design a logo?
The meatbots would be all over it. It is a major issue with web directories and this would be magnitudes larger.
|I think "add a site" would be less spammed than auto-find by crawling. |
Possibly but then you run into the negative SEO issue of people trying to knock out their competitors.
|I would only expect a small percentage of users to hit the "spam" button but that's still better than none. |
Again it is a moving target. The registrars are not in the business of managing the entitlement of each registrant to a domain name - with gTLDs, they are simply in the bulk domain registration business.
|One anti-spam method for both "add a site" and crawl may be to pay attention to the various reports of evil DNS and domain name registrars. Both can be discovered to at least a reasonable degree (which is why I do not understand the domain registration agencies failure to pick up on this). |
I've seen the same claims made before but they are wrong. The vast majority of domains registered each day are registered by ordinary people and businesses. The five day window in which a domain can be dropped without payment was abused but for domain tasting. If a registrar's five day deletes go above a certain percentage each month then they have to pay a percentage of the registration fee. This has significantly reduced the problem. There is also a development curve from the registration date of a domain to a fully functional website appearing (if ever) on that domain name.
|A vast number of the domains registered and dropped each day are registered by criminals and a fair number are used for virus-serving sites for a few days before being rumbled; and some of the more robust ones find their way into search engines. Something else that domain registrars could detect and deflect. |
If the SE uses the moronic GIGO approach used by Google and the other major players, it would need more than a single person. It is a continually changing threat environment and a link that might have been good yesterday could be hacked today and carrying a malware payload.
|Anti-spam/virus measures on an SE would need someone permanently assigned to the problem; that could be a downer on a start-up. |
Auto-find, otherwise known as Blind Crawling, is a very inefficient way of finding new websites. It also is junk prone. However the real issue is that due to the FUD by Google and its cargo-cult SEOs, the link structure of the web is decaying. Sites no longer heavily link to each other and this means that it is far harder to find sites. Reciprocal links, especially at the index page level are becoming rarer.
> The meatbots would be all over it
There are ways of detecting (most?) auto-submission agents, same as detecting auto-scrapers.
> negative SEO issue
Agreed. Not sure how to avoid that one.
Domain names / DNS - there are indicators in DNS and certainly some DNS servers are very suspect. I agree it would take a lot of work but what is the project's aim - to avoid as much spam as possible. Some DNS servers are "obviously" compromised and could be trapped.
I saw some stats that gave the number of criminal domains registered per day and was very surprised.
seoskunk - make sure to set an unique user-agent string with a url pointing to the bots page of the "site", even if the real SE has no existence as yet. For a new bot there should be at least a minimum policy set out - "We do not sell on" etc.
Before adding sites you could check there WOT rank, see [mywot.com...] Only add sites with a good reputation. If you dont want adult content you can also check the child safety.
Instead of manually adding sites you could choose to start with the 1 million top sites of Alexa.
Yes but the meatbots sometimes do manual submissions. The auto-subs are easier to detect.
|There are ways of detecting (most?) auto-submission agents, same as detecting auto-scrapers. |
The negative SEO one is manually intensive as it would require people checking to ensure that it is a legitimate delisting request.
This does get back to the "bad neighbourhood" concept and it is certainly a valid one because problem DNSes and website clusters exist. I do a lot of domain / DNS work because of my main website. At the moment, I'm running a full gTLD website IP survey.
|Domain names / DNS - there are indicators in DNS and certainly some DNS servers are very suspect. I agree it would take a lot of work but what is the project's aim - to avoid as much spam as possible. Some DNS servers are "obviously" compromised and could be trapped. |
Again, you have to be very careful about these numbers and their sources. An example was that someone was claiming, based on being a network administrator or avid reader of "technology" journalism that most domains registered and dropped within the five day window over the last five years or so were registered for spam. This was absolutely wrong because the complete day's drop on some days was being reregistered by domain tasters. The domains would have websites with PPC advertising and if they made a predetermined amount in those five days, they would be retained and if not dropped again. ICANN was shamed into changing the five day window (where a domain could be dropped without the registrar having to pay the registration fee) regulations. The system was corrupted at a registrar level. However what was being done wasn't technically illegal. The standard drugs/warez/pron sites do exist but some of these may use existing websites rather than new domains. As I said, you've got to be quite cynical when it comes to "technology" journalism because a lot of it consists of recycled press releases and often these press releases come from companies (anti-virus/anti-spam/anti-* ) trying to sell something. The real problem, from a search engine developer's point of view, is the number of compromised websites. Too many compromised websites in an index makes it a toxic index and then you have the same problem as Google.
|I saw some stats that gave the number of criminal domains registered per day and was very surprised. |
its out of the scope of this thread, but what kind of IP survey are You doing? I ask this because i do a lot of IP/BGP/routing research.
@bhukkel Mapping every gTLD website by IP address and then using that data for other research. It sounds simple when it is written like that. :)
| This 41 message thread spans 2 pages: < < 41 ( 1  ) |