homepage Welcome to WebmasterWorld Guest from 54.211.190.232
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Visit PubCon.com
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 41 message thread spans 2 pages: < < 41 ( 1 [2]     
A new Search Engine - What would you want from it?
seoskunk




msg:4586538
 9:27 pm on Jun 21, 2013 (gmt 0)

I am thinking there is a place in the market for a new P2P search engine.

I believe the funding of such an engine could be crowd sourced.

What would users want from a new P2P search engine and would you like to be involved?

 

dstiles




msg:4586915
 8:06 pm on Jun 23, 2013 (gmt 0)

hitchhiker - a good list but I hope not too bitcoin-oriented: they got compormised a few times recently. :(

Your comment "Distributed crawling would have to be done carefully" would have to be examined carefully. As a webmaster I could accept distributed crawling IF the crawling IPs were known. This excludes dynamic (broadband) based crawlers but could be server-based (eg using spare capacity of ordinary web servers), POSSIBLY with a reverse DNS entry, but the combination of IP and UA should suffice. Another warning, though: do not crawl from a cloud - the IP cannot be readily determined.

It occurs to me that in the early days the SE hosting server may have spare capacity for crawling.

Not sure about a "million accounts" but in any case I'd suggest purging accounts unused for (say) 12 months. On the other hand, allowing the creation of multiple accounts may help in tying down spam sites.

Going back to the old days of "add a site" is a good idea. I've given up (for now) adding (or updating old) sitemaps to sites: G doesn't seem to care and I've never had a problem with the top-3 crawling anyway.

It's a shame that frames are no longer an acceptable part of the web since a "spam" button could be permanently displayed on the frame. However, opening each clicked-on SERP in a new tab or window which would then need to be closed after use would return visitors to the index and a "Report Spam" button against each item. I say "would be returned..." but this depends to a certain extent on the brower setup.

lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative despite most people being unable to submit sites to it. An SE needs crawlers to get content anyway. Human editors just could not cope.

lexipixel




msg:4587159
 3:54 pm on Jun 24, 2013 (gmt 0)

lexipixel - you mean like DMOZ? An eventually corrupted system that G tried to claim was authorative
-dstiles


Sort of, but not exactly. I was DMOZ editor for several years so, some of my thoughts are based on the ODP model.

What I have in mind is more a network of directory sites with a common data format and some interchange mechanisms.

jmccormac




msg:4588047
 5:17 am on Jun 27, 2013 (gmt 0)

@dstiles
Going back to the old days of "add a site" is a good idea.
It is likely to be heavily spammed.

a "Report Spam" button against each item
This is a major problem with building such a search engine - relying on the user to detect spam. It is not a good way of handling things because some of the content that can end up in a blind crawl search may be potentially illegal.

@lexipixel
What I have in mind is more a network of directory sites with a common data format and some interchange mechanisms.
Something like ODP's RDF format? Again the issue of who makes money from all this arises.

What most people do not see with the web is that it changes from day to day. Thousands of domains drop and thousands more are registered. One of the biggest issues with DMOZ was that it had no ability to self-clean its index. Domains that were in DMOZ were often dropped and reregistered. They remained there despite often no longer having the original content or owner. This kind of thing also affects SEs.

Regards...jmcc

dstiles




msg:4588286
 7:52 pm on Jun 27, 2013 (gmt 0)

I think "add a site" would be less spammed than auto-find by crawling.

I would only expect a small percentage of users to hit the "spam" button but that's still better than none.

One anti-spam method for both "add a site" and crawl may be to pay attention to the various reports of evil DNS and domain name registrars. Both can be discovered to at least a reasonable degree (which is why I do not understand the domain registration agencies failure to pick up on this).

A vast number of the domains registered and dropped each day are registered by criminals and a fair number are used for virus-serving sites for a few days before being rumbled; and some of the more robust ones find their way into search engines. Something else that domain registrars could detect and deflect.

Anti-spam/virus measures on an SE would need someone permanently assigned to the problem; that could be a downer on a start-up.

seoskunk




msg:4588321
 10:13 pm on Jun 27, 2013 (gmt 0)

Lotts of great idea's coming through. Please keep them coming.

Ok so my first problem with a new open source search engine is it has no data, nothing so when people come to it they are bound to be disappointed. So I used a Bing script to bring in some results so that we wouldn't have an empty search result. These results come in below the results for own database.

Next I needed a crawler, open source, this will need to be replaced with a P2P crawler but to start with I grabbed a download of Sphider.And did a quick crawl of w3c.

Next I needed a way to get people to add there sites and thought of the suggestions about human review and Dmoz. In the end I thought the easiest way would be for people to bookmark them. The sites could then be voted up or down and those that gained positive votes would then be crawled. So I downloaded SemanticScuttle for this.

Lotts more to come, I have added the url of the site to my profile. I can't and don't want to do this alone so please get involved. Anyone out there that can design a logo?

jmccormac




msg:4588344
 12:53 am on Jun 28, 2013 (gmt 0)

I think "add a site" would be less spammed than auto-find by crawling.
The meatbots would be all over it. It is a major issue with web directories and this would be magnitudes larger.

I would only expect a small percentage of users to hit the "spam" button but that's still better than none.
Possibly but then you run into the negative SEO issue of people trying to knock out their competitors.

One anti-spam method for both "add a site" and crawl may be to pay attention to the various reports of evil DNS and domain name registrars. Both can be discovered to at least a reasonable degree (which is why I do not understand the domain registration agencies failure to pick up on this).
Again it is a moving target. The registrars are not in the business of managing the entitlement of each registrant to a domain name - with gTLDs, they are simply in the bulk domain registration business.

A vast number of the domains registered and dropped each day are registered by criminals and a fair number are used for virus-serving sites for a few days before being rumbled; and some of the more robust ones find their way into search engines. Something else that domain registrars could detect and deflect.
I've seen the same claims made before but they are wrong. The vast majority of domains registered each day are registered by ordinary people and businesses. The five day window in which a domain can be dropped without payment was abused but for domain tasting. If a registrar's five day deletes go above a certain percentage each month then they have to pay a percentage of the registration fee. This has significantly reduced the problem. There is also a development curve from the registration date of a domain to a fully functional website appearing (if ever) on that domain name.

Anti-spam/virus measures on an SE would need someone permanently assigned to the problem; that could be a downer on a start-up.
If the SE uses the moronic GIGO approach used by Google and the other major players, it would need more than a single person. It is a continually changing threat environment and a link that might have been good yesterday could be hacked today and carrying a malware payload.

Auto-find, otherwise known as Blind Crawling, is a very inefficient way of finding new websites. It also is junk prone. However the real issue is that due to the FUD by Google and its cargo-cult SEOs, the link structure of the web is decaying. Sites no longer heavily link to each other and this means that it is far harder to find sites. Reciprocal links, especially at the index page level are becoming rarer.

Regards...jmcc

dstiles




msg:4588587
 7:50 pm on Jun 28, 2013 (gmt 0)

> The meatbots would be all over it

There are ways of detecting (most?) auto-submission agents, same as detecting auto-scrapers.

> negative SEO issue

Agreed. Not sure how to avoid that one.

Domain names / DNS - there are indicators in DNS and certainly some DNS servers are very suspect. I agree it would take a lot of work but what is the project's aim - to avoid as much spam as possible. Some DNS servers are "obviously" compromised and could be trapped.

I saw some stats that gave the number of criminal domains registered per day and was very surprised.

seoskunk - make sure to set an unique user-agent string with a url pointing to the bots page of the "site", even if the real SE has no existence as yet. For a new bot there should be at least a minimum policy set out - "We do not sell on" etc.

bhukkel




msg:4588594
 8:05 pm on Jun 28, 2013 (gmt 0)

@seoskunk

Before adding sites you could check there WOT rank, see [mywot.com...] Only add sites with a good reputation. If you dont want adult content you can also check the child safety.

Instead of manually adding sites you could choose to start with the 1 million top sites of Alexa.

jmccormac




msg:4588657
 1:53 am on Jun 29, 2013 (gmt 0)

There are ways of detecting (most?) auto-submission agents, same as detecting auto-scrapers.
Yes but the meatbots sometimes do manual submissions. The auto-subs are easier to detect.

The negative SEO one is manually intensive as it would require people checking to ensure that it is a legitimate delisting request.

Domain names / DNS - there are indicators in DNS and certainly some DNS servers are very suspect. I agree it would take a lot of work but what is the project's aim - to avoid as much spam as possible. Some DNS servers are "obviously" compromised and could be trapped.
This does get back to the "bad neighbourhood" concept and it is certainly a valid one because problem DNSes and website clusters exist. I do a lot of domain / DNS work because of my main website. At the moment, I'm running a full gTLD website IP survey.

I saw some stats that gave the number of criminal domains registered per day and was very surprised.
Again, you have to be very careful about these numbers and their sources. An example was that someone was claiming, based on being a network administrator or avid reader of "technology" journalism that most domains registered and dropped within the five day window over the last five years or so were registered for spam. This was absolutely wrong because the complete day's drop on some days was being reregistered by domain tasters. The domains would have websites with PPC advertising and if they made a predetermined amount in those five days, they would be retained and if not dropped again. ICANN was shamed into changing the five day window (where a domain could be dropped without the registrar having to pay the registration fee) regulations. The system was corrupted at a registrar level. However what was being done wasn't technically illegal. The standard drugs/warez/pron sites do exist but some of these may use existing websites rather than new domains. As I said, you've got to be quite cynical when it comes to "technology" journalism because a lot of it consists of recycled press releases and often these press releases come from companies (anti-virus/anti-spam/anti-* ) trying to sell something. The real problem, from a search engine developer's point of view, is the number of compromised websites. Too many compromised websites in an index makes it a toxic index and then you have the same problem as Google.

Regards...jmcc

bhukkel




msg:4588791
 5:36 pm on Jun 29, 2013 (gmt 0)

@jmcc

its out of the scope of this thread, but what kind of IP survey are You doing? I ask this because i do a lot of IP/BGP/routing research.

jmccormac




msg:4588846
 12:44 am on Jun 30, 2013 (gmt 0)

@bhukkel Mapping every gTLD website by IP address and then using that data for other research. It sounds simple when it is written like that. :)

Regards...jmcc

This 41 message thread spans 2 pages: < < 41 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved