Privacy. Don't save your data more than 2 weeks.
Quick, simple, clean results.
The idea fits in our upcoming new Software release and I personally would appreciate something like that...
You will need to have a starting capital (something between $ 20k - 100k), even then you will need a hand full of developers (GNU?) with good reputation and skills, business plan, references and last and not least something to show that somehow gives an impression of the idea and works ( pre, pre-beta?), than you might come to the point to search for funds..today, an idea alone is not enough anymore ...
However, I could assist (very time limited) with our Software products, customer base, newsletters, web properties, advice & management (time limited) (if needed).
Thanks SEchecker your assistance would be very appreciated. Funding wise I could put in 25K myself and crowd source the rest
|brotherhood of LAN|
I'd suggest a search engine that tries to split information & money queries apart. The line would be blurry but I think the rewards would be worthwhile.
I also think a new venture should go local rather than 'the big win' of worldwide search. I can't put a number on it, but trying to spider the entirety of the web efficiently and in a timely manner is going to cost a lot in hardware & bandwidth.
I think the best bet is to collect (not cash ofc) some starting funds first... if you can reach 100k, that would be a good sign to move forward > make a business plan > find developers (offer shares or make GNU, open-source project)> create something to present the idea > search for real funds (aprox. 250k and up)> get REAL and start :-)
"spider the entirety of the web efficiently and in a timely manner is going to cost a lot in hardware & bandwidth."
I donít think this is even possible for a start-up..that seems fantasy.. you will need to work with existing sources. Build up an index and operate it would cost you billions... Maybe you could rent at beginning but thatís security wise not smart and also the costs would eat up all funds in a sec..
Wouldn't P2P eliminate the hardware costs?
Can you specify p2p search engine please... (what you have in mind)
Somwhere you will need to have the index...
P2P search engine would distribute tasks and crawls among its network, partitioning to be carried out among peers.
If the network has 5 user..where is the hole Internet index? I cannot follow you... per example: file sharing requires the file that a user request to download to be stored on some device connected to the file sharing app... same would apply to the Internet index (database)...
Or i donít get the p2p idea at all :-)
True bits of the index would have to be stored on the peers computer and that would have to be online for the database to work. The more people involved the more stable the index would become.
i need to think on that, its late here..so tomorrow :-)
here's a simple idea. Take Google's results and flip them upsidedown.
Essentially, what you meant by P2P search engine is to rely on user computers rather than on a central repository if I am not wrong. You will still need a central repository to store all your crawling information or if you want to categorize information and distribute it to peers then it may help serving faster queries geographically.
Personally, I think P2P search engine is useful if you have central repository stored in multiple servers (like google), you have cached common search terms and the cached result is distributed in peers.
I hope it will help you!
no adverts (or at least make the main priority the top results), no annoying fat nerd making videos
maybe even a super reward for top 3 sites as in
top 3 sites
then some small ads (obviously needs to be in another color and with label saying these are ads)
and then rest of the results
There are plenty of SE's out there, I built an index of 100M pages running on 5 midrange servers, crawling, serving, etc but the project fell apart due to lack of funding. Results served were indeed better than what google is providing.
To get a good prototype pre-production engine in place realistically is in the 200K range, figure 100k or so on machines, bandwidth etc., programming and labor another 100k.
How to monetize search is the challenge, today everyone is doing a "me too" bit with highest bid PPC, no incentive for CUSTOMERS, without ad buyers you have no model. I think a flat rate ad model would work as legacy PPC is just a margin eater, a tax of sorts that scales with your margin, it is not friendly to the bottom line and is chaotic in the sense that your ROI can change by the second.
If a marketer could buy 5 cent clicks at XYZ engine, that marketer would tell people to use XYZ engine and you really need that word of mouth.
Offering a better result set than google? That's easy peasey, they deliver junk and hide the goodies but you really need to offer something compelling to get the buzz moving. Compelling to me is a product that enhances my bottom line, restores sanity to my marketing budget.
BTW, you don't need a 100 billion page index out of the box, a quality 100 million page index will answer 95 percent of queries, you fill in the blanks from there, it's not that hard.
Perhaps You can join this project.
yacy is a bit of a geeks toy, I don't see it as something that would become widely adopted. I do think it's pretty cool though,I love to see others out there cracking away at search.
I've thought long and hard on this subject and the challenges for a small start up with a very limited budget would make such a project practically unachievable. 15 years ago it would have been a lot easier. Crawling the web and keeping that content fresh is a huge task in of itself.
Rather than try to be the next Google, why not aggregate the web (like the Internet Archive) using bots and other sources of data, then lease that data back to other small start ups. It could be an open source project of some sort. This would allow search engine start ups to focus their entire budgets on search algorithms (instead of having to utilize costly and time consuming resources for crawling). Eventually, every site on the web could have it's own custom refined search engine and there would be no need for Google.
Mj12 is a p2p search engine?
My choice for default SE is DuckDuckGo
Do NOT use distributed crawling (as per mj12). Webmasters cannot easily determine who a crawler belongs to unless there is a reverse DNS lookup that returns a suitable indicator AND there is a web page that preferably defines what IPs are used in crawling and certainly what the User-Agent is. There must also be a reassurance page ("We do not sell your site content" etc).
If you try to use distributed crawling you will lose web sites with security-concious hosts and/or webmasters.
If you do not reassure webmasters and hosters they will block your crawlers. At least, the security-concious ones will. There are SO MANY bots trying to crawl and scrape site content it's ludicrous. If I enabled every bot that hits my server, even excluding the obvious criminal scrapers, I would have something like ten times the number of hits of interested customers.
Before starting any crawl service read the past year (at least) of [webmasterworld.com...]
As to the actual SERPS: it would be necessary to be very good at discerning scraper and spam sites. Is anyone here good enough?
Hey, great topic, I've been thinking about this a lot:
It has to allow for error, not try to be perfect.
It should use a 'ledger' system to create a secure, 'untamperable' audit of everything. Similar to bitcoin. The index and user base would be stored in this, and completely transparent and readable. (Personal security is not affected due to users just being 'GUIDs')
Distributed crawling would have to be done carefully.
Reverse-Index would need (unfortunately) to be stored on every node. (Index wouldn't be as big as you think, but this part definitely means it's a project that isn't quite ready for prime-time this year.. 5 years)
I would prefer mass-volume user ranking above everything else algo-wise. (Good outnumbers SPAM, so the larger the user-base, the more accurate that signal) - I could harp on about why for hours.
Very simple algos, community has to learn from the mistake Google has made. Vote signals rule above everything, but basic algos help earlier on and in general to give a start pattern for new serps.
No more random crawling, you have sitemap, you register your site, or we don't index it. Any site not doing that probably can be ignored these days.
Prefer positive user signals over negative ones (by a considerable margin). Possible positive rating and [SPAM warn] buttons only. No negative rating. Mainly because people voting 'negative' tend to do it for weirder reasons than people who are 'satisfied' in some way. The 'positive' signal is stronger.
SPAM will exist, but user feedback (the larger the user base grows) would eventually make it difficult. Unlike what we have now, the amount of useful users would far outnumber the amount of BHs faking users. BH would be almost impossible like this, permanently. They could rig voting, but again, the user base would be an order of magnitude greater than anything a single group, or forum could muster up.
Users can search completely anonymously, but are unable to vote like that.
User = GUID = does not need to be tied to a name. (Same as Bitcoin)
You can create any amount of accounts automatically, a million if you like. We base our techniques for 'anti-BH' on the fact that we accept this.
Perhaps some IP restriction on volume account creation. That would be a little 'gray' but may be sensible to avoid bloating the ledger with BH attempts.
Users vote power increases as time passes, and as votes are cast by an account that the network accepts. Again, people spamming votes with a million accounts would be up against a wall with that.
Open source, obviously. Community of developers, and community of experts make and vote on changes over time. Anybody can fork, anybody with a good idea should be able to prove it works.
Public nodes allow for the system to be used as a web-service. Anybody is able to do that, if they modify the results people can decide whether they want to use that 'service' or not.
Non-crawling nodes for those on networks that would have a problem with that.
Crawling in general would be the best place to inject spam, so new sites have to be registered with the network first. "add a new site" - at that point they are sandboxed for a short period. Slowing down some BH (that wouldn't work anyway). Then the nodes can begin crawling.
Method for segmenting crawling tasks would be needed, probably a prior scan of the sitemap, and then a series of requests to random nodes to perform. Results audited, with user accounts - used to detect and lock out bad accounts if needed.
Nodes cannot request these tasks, it has to be assigned to a random node associated with a known user (reduces the chance of a BH node getting a task it wants).
That's all I can think of for now, totally immature process - just a start.
Great Comments above: Something to think about, I am not worried about the NSA I am more worried about what many providers already have on everyone... Think about how much data we have on companies with third party tools, etc. We just don't know the exact person so we study data sets of information that could maybe lead to a visitor ordering our product. Then we have tools to follow up with them in a few weeks with email about the bounce they triggered the email of course we don't tell them that but our shopping cart said you may have had an issue during checkout and we just wanted to follow up..
Think About how much data you are responsible as an analyst?
I have no skills with which to help anyone create a new search engine. But as someone who's been searching since the 90s and is now having more trouble than ever finding what she's after, I would love to see an alternative engine that guides me through a few questions to hone my results. I put in a term, and it asks me if I'm shopping, looking for information, etc. It could have preference settings, like do I want big brand sites or more off-the-beaten-track sites? It could offer easy ways to eliminate certain words/types of sites from a query (basically functioning just the "-widget" thing that most searchers have never heard of).
If it did this, it would NOT be competing directly with Google, which tries but often fails to guess what you're looking for. This alt SE would admit right up front that it needs some context, and it could be marketed as the engine that takes a little input from you to give you truly customized results.
I'd want an index that was complete (very little pre-filtered), yet somehow safe to use, and my ability to customize it as a searcher to be as absolute as possible. Not an AI search engine that second-guessed me (or at least, not one exclusively), but rather one that made ME more intelligent. Ideally, I could choose from a bunch of preset engines (from customizable to ones with AI components), or customize my own entirely in a user-friendly way. So it would have virtually an infinite number of logical commands to be used if desired, but also modes where I could sit back and say, "OK, do my thinking - whadya got for me?"
No crawlers. 100% Human edited.
Crawlers just pick up the content of the page (to be put in the reverse index)
Algorithms enabled content promotion / demotion. I agree that algos need to play a very small role. I prefer 90% human feedback (until we have a full AI - that's a different story for another bigger conversation)
@diberry - yeah, that's an interesting take - a simple manual context selector. Also the idea of full searcher customisation / query filtering - which would be cool.
The whole thing is a pipe dream, but then a lot of great things were. We'll see this project live one day.
Make it Open Source - keep the greed & corporate profits out of the algo.
| This 41 message thread spans 2 pages: 41 (  2 ) > > |