Creating alternative to Google serps

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Creating alternative to Google serps

EditorialGuy

6:50 pm on Jul 23, 2014 (gmt 0)

System: The following 6 messages were cut out of thread at: http://www.webmasterworld.com/google/4690067.htm [webmasterworld.com] by goodroi - 9:17 pm on Jul 23, 2014 (utc -5)

Need in-depth research information about a product? Google won't provide it so searchers have to go elsewhere.

Yes, and in most cases, they go elsewhere by clicking on Google search results.

webcentric

3:09 am on Jul 29, 2014 (gmt 0)

There's a lot of talk in the thread of how things should be done after the search engine is built but very little discussion on how to create an index or build the search engine

Spent today digging deeply into inverted indexes, forward indexes etc. and, while search engine design is not a simple subject, I'm of a belief that it doesn't have to be as complicated as the scale and scope inherent in the major engines dictates.

Full-text indexing and maintaining a large, constantly changing inverted index can presents a host of challenges but scaling down and being rather selective in what you index could somewhat minimize the downsides associated with rebuild the index when the underlying data changes. Small engines could do this in a matter of minutes during off hours without causing too much disruption to the end-user experience. I could see this happening once a week or less if you're really hand-picking your data.

If you're crawling billions of pages like Google, then you pretty much need to be constantly maintaining the index and it's gonna take a lot of time and server resources to do in. In a smaller context, updates can be batched during off hours to greatly minimize the impact on the end user side. So, my first point is about scale. Engineering the next Google is one thing, engineering a smaller niche engine based on a selective body of content, is another. Both will benefit from careful engineering but, if quality is used as a scoping factor, you may be able to avoid some of the issues associated with mega-indexes. Who says you have to index everything to be useful and relevant? Certainly not me.

Do I have to be fair? No. It's my index. If I want to let people know about your content, lucky you. Enjoy the traffic.

I just built an inverted index for the King James Bible in less than 30 seconds. Translate this to a couple hundred thousand rows of page data (which would make a pretty decent library of information on a given subject) and we've got the foundation of a fairly manageable index.

webcentric

3:54 am on Jul 29, 2014 (gmt 0)

@iammeiamfree -- I think finding pages to index is the least difficult problem to solve in some ways. The real issue is qualifying the content for inclusion in the index and subsequently ranking it. With a good set of quality rules in place as a sort of first line of filtering, you're left with how to rank the content you do want to index.

Finding pages to index (or consider for indexing) can be done in a variety of ways including crawling, webmaster submissions, scraping your way through various SE query results etc. Even the basic concepts of traffic exchange, email invites etc. that you mention could be effective. How you analyze their content is the bigger question in my mind. And of course, ranking is where the attempts at manipulation occur.

Hmm, maybe throw a randomize option into the search e.g. "Show me 10 random results for "How to trim a cat's toenails." If every result returns a really good answer to the question, who cares. I'm thinking that if you make getting into the index at all the real Holy Grail (rather than where you rank for a particular keyword/phrase), then some of the reasons for manipulation can be removed. Pie in the sky, I know. Just throwing out some late night ideas for consideration.

Shaddows

7:32 am on Jul 29, 2014 (gmt 0)

Do I have to be fair? No. It's my index. If I want to let people know about your content, lucky you. Enjoy the traffic.

Google should just replace their Guidelines with that statement.

jmccormac

10:37 am on Jul 29, 2014 (gmt 0)

Finding pages to index (or consider for indexing) can be done in a variety of ways including crawling, webmaster submissions, scraping your way through various SE query results etc.

It is a lot easier than that. However the problem is sorting out the gold from the dross.

When people look at the numbers of domains registered in a TLD, they often have no idea of the number of active websites in that TLD. Some of the numbers posted by registries are untrustworthy at best (iffy "surveys" from consultancies of even iffier expertise) or woefully naive surveys by non-experts on small sets of domains. With large TLDs such as .COM (113.7M domains), surveying the whole TLD (or spidering it) is a very complex task. This is a 110K domain survey of .COM domains and their websites broken down by usage and development category that I ran earlier this month: [hosterstats.com...]

Basically the number of domains in a TLD is almost always lower than the number of domains registered in a TLD and the reason that .COM is so well developed is down to age and usage. It is the default global TLD and can represent about 40% or so of domains registered in particular countries. However as a global TLD, there's no real geographical differentiation so that one might be crawling a site from China followed by one from Canada. But the main thing would be to exclude the noise when crawling and that means concentrating only on the required sites (active content) and dropping everything else (PPC pages, holding pages, clone sites, duplicate content, in zone redirects etc).

If one imagines the web as a city, then there are going to be a few sites which resemble skyscrapers in terms of content and they will have hundreds of thousands or millions of pages. There will also be large multistory buildings with thousands or hundreds of pages. Then moving outwards from the city centre, the suburbs are filled with sites with a few hundred pages of content. There are also the shed sites with less than a hundred or so pages of content. And all these web properties are in various states of development and or decay.

Every month, I run web usage surveys on hundreds of thousands of websites. Some months, over a million websites are surveyed. The interesting thing is that not all websites are actively developed or maintained. Approximately 23% of sites are not updated or changed within a year. Many of these sites are brochureware sites. Then there are the abandoned sites (sites where development has ceased). The rate of abandonment is a very important metric in web usage surveys and it is one of the indications of the health of a TLD. Some abandonment happens early in the lifecycle of a TLD when people register a domain, put up a copy of Wordpress or Joomla and then find out that the discipline of writing is a lot tougher than liking a page on Facebook. Again there are indicators in the site (a single 'hello world' post) and post dates that can be used to identify these sites. There are also compromised websites that have either been defaced or have suffered link injection hacks with dodgy links that are only visible to search engines. This is a tricky problem but it can be solved quite easily. (The people in Google aren't quite the calibre necessary when it comes to solving it and that's why it has such massive problems with payday loans, counterfeit goods, drugs etc.) Keeping the index clean is an important issue but there has to be an index first.

The most complex part of building any search engine is knowing what not to include. The second most complex part is identifying what you need to include.

Regards...jmcc

iammeiamfree

10:44 am on Jul 29, 2014 (gmt 0)

Do I have to be fair? No. It's my index. If I want to let people know about your content, lucky you. Enjoy the traffic.

I just built an inverted index for the King James Bible in less than 30 seconds. Translate this to a couple hundred thousand rows of page data (which would make a pretty decent library of information on a given subject) and we've got the foundation of a fairly manageable index.

Yeah so you could create an index for King James Bible and that would be a table in your database. All the global search engine would need is an entry for King James Bible index, the age of the index, how recently it has been updated and a quality score derived from webmasters in the niche and related niches. Then when a user makes a query for King James Bible + Genesis the engine looks for the row King James Bible and then goes across to your database and returns results for Genesis from your index.

If there is a query for bible in cornish and we don't yet have a row for that it in the search engine then for the time being we can just redirect to google search?q=bible+in+cornish but we record the query and make the information available to webmasters in bible niche.

If a spammer makes a viagra index called king james bible you are informed and review the index and mark it as spam. Other sites also realise it is spam and it gets removed. If another site also wants to make an index of king james bible they can either import your data or make their own. If it is good then the webmasters will mark it to use for returning results for those queries.

When you login to your mini search engine you can see what people have been searching for and go thru and choose the best mini indexes to use or import the data from the best of those and rework it according to your editorial rules. Alternatively you could build one from scratch by scraping pages or search engines but you hand select the results to be included.

If a webmaster tries to build too many and too large mini indexes those indexes are going to fall afoul of the quality ratings of the other webmasters so the best approach is to only work on building and maintaining small precise and manageable indexes around ones niche.

The editorial preferences of a group of webmasters in and relating to a niche are balanced together to choose an overall ranking for sites.

We leave google to deal with the spy agency queries like ukraine results for the last hour etc. and focus on high quality
mini indexes to keep visitors engaged and browsing our network.

gduffield

11:23 am on Jul 29, 2014 (gmt 0)

Wow...funny I was just telling my wife 2 weeks ago, there's going to come a time when we look back and remember how big Google used to be. I'm not saying the beginning of that is happening here and now in this thread, but I do believe it will happen. I could be wrong, but in the next 10 to 15 years we will look back on Google like you look back on Yahoo, and I believe they will be just about as relevant.

Even people I talk to that having nothing to do with websites or SEO, spout off about how much they are starting to hate Google. Not only for what they deliver in search results, which is still heavily filled with ads and links to spam sites, but also just that they are getting so intrusive...and big. Their arrogance knows no bounds, and I can sense people are starting to feel that.

I personally believe the market is quickly becoming primed for a fresh, easy to use alternative, that guarantees privacy with no tracking and complete anonymity. While there may be some alternatives out there, it will take greater genius to promote it properly than it would to come up with the actual indexing algos. You would need to come up with creative incentives to get people to use this new engine and shift their habits, but there are ways to do that.

iammeiamfree

12:42 pm on Jul 29, 2014 (gmt 0)

The approach of involving webmasters in creating and maintaining the search engine implicity offers a huge marketing advantage.

The sites involved have search functionality incorporated on their sites as well as resources and related links content on pages around their sites.

Participating sites will be able to provide visitors with an enhanced user experience. Users can use our sites as a staring point to explore the topic in detail for hours or even days.

Privacy could be a key selling point. The engine would be seperated out into mini indexes maintaned by the distinct webmasters and no user identifiable data need pass between sites or even be recorder by the sites.

jmccormac

12:43 pm on Jul 29, 2014 (gmt 0)

If a spammer makes a viagra index called king james bible you are informed and review the index and mark it as spam.

Spam as the result of compromised websites would probably be more common in this scenario than genuine spam websites. The genuine spam websites tend, over the last few years, to have links from the compromised websites.

Regards...jmcc

iammeiamfree

2:16 pm on Jul 29, 2014 (gmt 0)

Version 1.0 can just be the tools for starting to build the mini indexes and adding resources content on our sites. The option of rewarding our top referrers helps encourage cooperation.

Other webmasters notice incoming referral traffic and join us.

Once the search engine is set up we reward the best maintained indexes. When you login to your search site you can check some of the oldest sites in your index incase they are compromised. Should there be a problem other webmasters with compromised site in their index are informed to check/ remove that site and we inform webmaster of compromised site.

Seo has taken on a whole new meaning now. Soon webmasters traffic stats will be hot with our referral traffic and they will be joining us in droves.

The webmaster really will be optimising search.

EditorialGuy

3:09 pm on Jul 29, 2014 (gmt 0)

I could be wrong, but in the next 10 to 15 years we will look back on Google like you look back on Yahoo, and I believe they will be just about as relevant.

There's one critical difference between Google and Yahoo: Google continues to invest heavily in both its core product (search) and new products, while Yahoo has coasted along on the value of its 1990s brand name.

jmccormac

3:50 pm on Jul 29, 2014 (gmt 0)

There's one critical difference between Google and Yahoo: Google continues to invest heavily in both its core product (search) and new products, while Yahoo has coasted along on the value of its 1990s brand name.

Nah. Google buys a lot of its products because it hasn't the people or the ideas to develop them. Almost everything it has done for the last few years has been poorly executed, in many cases, me-too derivative stuff. (Google Plus, Orkut, Buzz etc.) Google is in trouble with search and it has become an advertising company where search is merely an outlet for its advertising service. Developing a product or service takes innovators and entrepreneurs rather than the joiners that just work for companies. Without that spark of innovation, that original idea, all you get is the me-too dross. It works for a while because many "technology" journalists haven't a clue about technology or the business of technology so they rely on press releases for their "knowledge". But sooner rather than later, those me-too businesses crash and burn. Then they are quietly shuttered and their employees are either fired or shuffled elsewhere in the corporation.

This is the Wikipedia list of discontinued Google products and services:
[en.wikipedia.org...]

Yahoo ran into the same problems that Google is facing now - the wrong people making the wrong decisions. When Facebook whet for stockmarket flotation Google rolled out its exercise in braindead plagiarism, its "knowledge" graph (a Wikipedia scraper). Well almost everyone saw that Social Network movie and people were beginning to talk about the Social Network graph. So some PR flack in Google rolls out a bogus Star Trek:TOS song and dance to go along with the Scraper Graph. (Never mind that Apple's Siri had essentially implemented the Star Trek ship's computer idea complete with the pleasant female voice.) Godaddy, the largest registrar and domain hoster in the industry goes for flotation so along comes Google with its "Google Domains" registrar to get in on the action. Google really needs a Steve Jobs of Search but instead it all it has is a bunch of soda pop salespeople.

An alternative to Google's SERPs has to be better than Google, more precisely targeted than Google and give the people what they want. It also has to be beneficial for webmasters in that they get traffic for their content and are not ripped off by having their sites massacred by the algorithmic brainfart of some individuals trying to repair the damage created by other individuals. Seemingly none of these individuals ever built any original content based website of worth and wouldn't know a decent site if it slapped them on the side of the head with a trout. Get that part right and the traffic to this hypothetical search engine will rapidly gain critical mass. And maybe, in the process, it will get to LART Google with a hypothetical trout. :)

Regards...jmcc

EditorialGuy

4:17 pm on Jul 29, 2014 (gmt 0)

An alternative to Google's SERPs has to be better than Google, more precisely targeted than Google and give the people what they want.

Easy to say. Tougher to do.

It also has to be beneficial for webmasters in that they get traffic for their content and are not ripped off by having their sites massacred by the algorithmic brainfart of some individuals trying to repair the damage created by other individuals.

To succeed, a search engine has to benefit searchers.

Getting back to the "10 to 15 from years from now" prediction:

I think we'll see greater differentiation between information and commerce than we do now, for a start. (Kind of like the difference between the White Pages and the Yellow Pages in the heyday of phone directories.) This would benefit searchers (better results), the search engines (higher ad revenues), and businesses that were looking to acquire customers, not just immediate sales.

CaptainSalad2

4:33 pm on Jul 29, 2014 (gmt 0)

I'm not saying the beginning of that is happening here and now in this thread, but I do believe it will happen.

It would be hilarious if there was an online open source search engine revolution that decimated Google in 10-15 years and it started here, in this forum, in this thread and it was traced back.......

The thread says "started by Editorial Guy". Always the one you least expect lol ;)

EditorialGuy

4:51 pm on Jul 29, 2014 (gmt 0)

The thread says "started by Editorial Guy". Always the one you least expect lol ;)

Actually, it was started by Goodroi, the moderator. You've got to read the fine print. :-)

jmccormac

4:53 pm on Jul 29, 2014 (gmt 0)

It would be hilarious if there was an online open source search engine revolution that decimated Google in 10-15 years and it started here, in this forum, in this thread and it was traced back.......

The thread says "started by Editorial Guy". Always the one you least expect lol ;)

The irony would be delicious. :)

Regards...jmcc

webcentric

5:01 pm on Jul 29, 2014 (gmt 0)

No one reads the fine print. The first post has EG's moniker on it and that's what people searching for the history of Google's downfall and the rise of webmaster-driven, open source search will see 15 years from now when they find this thread. Ironic indeed. ;)

brotherhood of LAN

5:10 pm on Jul 29, 2014 (gmt 0)

At least for me, any new engine would only have to cover the English language. For the 1 in 500 searches I need a translation, hopefully there'd be a translate function to use on a different language engine.

jmccormac,

What do you think is the best way forward, another global player or lots of niche/location/language specific ones?

jmccormac

5:15 pm on Jul 29, 2014 (gmt 0)

If it happens then dear leader Larry and perhaps Sergey will be working on a Google Terminator to go back in time to eradicate this thread and stop EG from triggering the end of Google. :)

Maybe it is worth looking at the problem again and coming up with a solution.

Regards...jmcc

jmccormac

5:26 pm on Jul 29, 2014 (gmt 0)

What do you think is the best way forward, another global player or lots of niche/location/language specific ones?

Universal search died in 2004, Brotherhood of LAN. The reason was that Wikipedia and the rise of Social Media killed it. For the last ten years or so, school kids and students have been using Wikipedia rather than Google. Even Google's Scraper Graph is a grudging acknowledgement of this fact. The future, I think, is a spectrum of search with a lot of niche search engines that may be accessed via a common interface or via their own interface. The main issue for this renaissance of search would be that each component search engine would have a high quality index.

Regards...jmcc

webcentric

5:43 pm on Jul 29, 2014 (gmt 0)

And back to our regularly scheduled discussion...

The primary challenge for any resource that seeks to index all or part of the web is ranking which is directly associated with the availability of on-screen real estate. You can only put so much information on a page before the page becomes unusable.

Directories address this with drill-down navigation primarily (and then paging at the leaf level) and SEs address the matter with various ranking schemes and results paging.

Ranking has always been a competitive game (even before the Internet). Take a standard phone book and look for businesses that begin their name with AAA for example. This is a simple example of how to work your way to the top of an alphabetical index.

So, any attempt to organize results will lead to favoritism of some sort, which then leads to attempts to compete for the top spots. Back in an earlier post, I suggested randomizing results (a bit tongue-in-cheek to be sure) and this, it would seem is one of the only ways to truly remove the concept of gaming the system. Oh, except that, like a raffle, the more raffle tickets you buy, the better your chance of winning the raffle.

So, ranking is the real challenge and some pretty good minds have tried to tackle that already with limited success. Any thoughts specifically on that topic? It's what we all love to hate about Google, YRMV.

EditorialGuy

5:51 pm on Jul 29, 2014 (gmt 0)

The first post has EG's moniker on it and that's what people searching for the history of Google's downfall and the rise of webmaster-driven, open source search will see 15 years from now when they find this thread.

I'll even contribute a slogan for you to use in marketing a "webmaster-driven, open source search" engine:

"1.17 billion Google searchers can't be right."

webcentric

5:54 pm on Jul 29, 2014 (gmt 0)

"1.17 billion Google searchers can't be right."

Right about what? They're all looking for something. Question is, how many are happy with what they find and how many would gladly use something better if it existed? So, maybe...

"1.7 billion ex-Google searchers now get better results with WDOS search."

trabis

6:24 pm on Jul 29, 2014 (gmt 0)

Hello World,

Could we avoid scrapping, crawling and large indexes and go for a low budget approach?

Many websites provide a search box that it is powered by the site itself (cms). We could proxy the user queries to this websites, collect the answers and provide the results.

Any website that wanted to be listed in the results would have to implement a script on his server that would be used by the search engine. Think API.

1- A user enters a query on the search engine
2- The search engine searches for registered websites that might be able to deliver the answer. This is the hard part!
3- Deliver a formatted query to those websites, 20 websites should do.
4- Collect the responses, remove bad results, and display the best 10.

The search engine responsibility would be to rank the websites according to user feedback, sites response times, bouncing.
The websites job was to deliver the best results for the queries in the shortest time possible.

WHO GAINS?
Well, anyone could build this search engine. The important step is to have a common API to be used by the website owners. If someone comes up with a better ranking algorithm, a better caching service, a better search experience, I'll be good with it!

WHO PAYS?
Well, most of the cost is distributed among the website owners. If you have a good site, you will be queried more and you will be spending YOUR server resources. But cache will be your friend and since you will be trading API queries for real visitors, cash may be your friend to.

I always wondered how can Google possibly know what is the best love poem on my site. If Google was to ask me what page to deliver for a certain query, his users would get a better answer.

brotherhood of LAN

6:29 pm on Jul 29, 2014 (gmt 0)

jmc said: The future, I think, is a spectrum of search with a lot of niche search engines

I think so too. IMO (and after thinking about it a little while), there's room for a global outfit doing the fetching, parsing and normalising of documents. It'd maintain indexes of domains, citations, documents and words (and associated metadata like DNS info, document size, word hit types and what not), much like Google's first prototype.

The "fancy" part would be to classify the words in a theme taxonomy in order for niche providers to take a subset of the index in order to rank and serve results. The goal is to provide exactly or not much more of the indexes than the person needs.

The ranking part is up to the niche provider.

Half of it is already written [github.com], and there's already outfits [majesticseo.com] crawling a large subset of the web. The dots just need connected IMO.

Ah, one other thing worth mentioning (and already mentioned above), the centralised part also needs to provide a service to point users to relevant search engines. No point having 00000's of engines as no one would remember where to find them. Ranking those would be an issue :)

CaptainSalad2

6:37 pm on Jul 29, 2014 (gmt 0)

I'll even contribute a slogan for you to use in marketing a "webmaster-driven, open source search" engine:

"1.17 billion Google searchers can't be right."

Imagine Larry had the same attitude 15 years ago about AltaVista, where would you be now, over on the AltaVista forum ;)

Technology firms are always being replaced by newer more innovative firms, its just the way tech work.

Now Google has completely replaced innovating organic search with monetising organic search, it could be vulnerable to an open source engine and a webmaster grass roots movement that freely promotes it.

:)

EditorialGuy

7:48 pm on Jul 29, 2014 (gmt 0)

Technology firms are always being replaced by newer more innovative firms, its just the way tech work.

Aha! So Google isn't an immovable monopoly after all. :-)

FWIW, I'm not sure that Google Search falls under the heading of "tech firm." For the end user, its product isn't technology, it's search results. Technology is simply a means to an end.

webcentric

8:41 pm on Jul 29, 2014 (gmt 0)

The "fancy" part would be to classify the words in a theme taxonomy in order for niche providers to take a subset of the index in order to rank and serve results

This is exactly where my thoughts have been going all day...glossaries, categories, etc. laid over the larger system to break it into niches...glossaries could be used to create a set of common searches related to a given niche and some sort of categorization system can limit search scope to a particular topic. Thesaurus features would be helpful here for sure.

This also makes me think this is how Google could bust the whole thing e.g. by taking what it already has and breaking it apart into niche engines and serving up a front-end, developer API and other components. And of course it brings me back to where I think Google could and should separate commerce from information generally. That IMHO, would clean up Google dramatically.

EditorialGuy

8:55 pm on Jul 29, 2014 (gmt 0)

And of course it brings me back to where I think Google could and should separate commerce from information generally. That IMHO, would clean up Google dramatically.

I agree that, but as long as we're talking about alternative search engines, here's some food for thought:

Niche search engines don't necessarily have to be built around topics. They can also be built around audiences. For example, you could have:

- A gay commerce search engine (gay-owned and gay-friendly vendors)

- A fair-trade commerce search engine (vendors of "fair trade"-certified products)

- A fundamentalist Christian commerce search engine (Evangelical Christian-owned vendors, and all products--even gemstones--guaranteed to be less than 4,000 years old).

- Search engines for Zionists, Islamists, Tea Party radicals, etc. who don't want their results tainted by sites that don't share their views.

Such search engines wouldn't need to be technically innovative or even "better than Google," since their unique selling proposition would be suitability for specific target audiences.

Martin Ice Web

9:18 pm on Jul 29, 2014 (gmt 0)

Alternative to google is qwant.com
I like the clear seperation of serps and ads. For me it returns good results and i will for sure will test it over the next few days.
The engine does not track user data or make personal result sets. LIKE!

jmccormac

9:30 pm on Jul 29, 2014 (gmt 0)

This also makes me think this is how Google could bust the whole thing e.g. by taking what it already has and breaking it apart into niche engines and serving up a front-end, developer API and other components.

It already has form on this. When Wikia Search announced it was going to have a Social Media element to its search engine (along with voting on results), Google tried the same thing. It also tried to copy Wikipedia with its "Knol" service where editors would be paid to edit. Naturally, those people in Google completely missed the reason that people create and edit Wikipedia pages. It for the sheer joy of creating something of worth to others. The "Knol" service was quietly closed and people continued to use Wikipedia oblivious to the short existence of Google's poor attempt at a clone. Perhaps it was some kind of cultural clash between the some non-creative, money driven people in Google, who apparently thought they could buy Wikipedia's editors, and the web's creative and more altruistic people who often edit Wikipedia.

And of course it brings me back to where I think Google could and should separate commerce from information generally. That IMHO, would clean up Google dramatically.

It might. But it faces the prospect of the commercial side becoming a Pay For Include with the non-commerical side being plastered with adverts and the Scraper Graph. Google may have a problem splitting the two from a commercial point of view.

One aspect that would be important is that Google's scrapers should be blocked from access to these (as yet hypothetical) search engines.

Regards...jmcc

This 99 message thread spans 4 pages: 99