Forum Moderators: open
Link: nytimes.com/2004/09/30/technology/30search.html?ex=1254283200&en=d35197da5d8cccb9&ei=5088&partner=google
The New York Times is running an article about Clusty, a start up search engine that "clusters" like websites together and then uses that to display relevant information to the searcher.
We all know Google uses keywords, but with the size of today's internet keywords are not the best method.
"As databases get larger, trying to pull the proverbial needle out of the haystack gets tougher and tougher...
Example:
A would imagine a good amount of people search very general terms, for our example: "Skydiving Overseas".
A Google search would look for that keyword phase. However this is detrimental for the searcher because it is ignoring sites in the same category but that use a different phase:
"Skydiving Abroad"
"International Skydiving"
"Skydiving in Mexico"
Experienced websites have developed an improvised method of getting these keywords via SEO:
<Title> International Skydiving Abroad and Overseas in Mexico </Title>
But VERY few websites do this, certainly less than 5 percent.
A clustering website would (in theory) be able to display all these sites perfectly, and I bet Google knows it.
We know that Google's frontend is at an all time low in activity (not saying it isnt active, just less active), and we know the Google's backend is working on SOMETHING like crazy (some suggest a complete rebuild of the google database).
Might I now propose a theory:
Facts:
1. Many internet search experts say "clustering" is a better way to organize today's massive internet.
2. Many new clustering search engines are coming out, but currently do not have nearly the power of google in term of actually crawling the net for sites (this was admitted by Clusty in the article).
3. Google is about to face it's biggest challenge yet from Microsoft.
4. Search engine users are fickle, whomever provides the better results wins. (this is how google won the first war).
5. Google's backend is very busy while Google's frontend is very static.
Theory:
Google is moving from a keyword based system to a cluster base system, hoping to beat the smaller companies by using their advantage on the "crawling gap" to provide a cluster base system FAR larger than these new companies could ever hope to produce.
After years of trying to rework the algo to banish "crap-pages" and promote "great pages", each time being thwarted by SEOers (black hats mainly), Google has started on something brand new, new algo, new theory, and perhap even a new deliver method (gbowswer?).
This new method would be able to provide better results (an example of this was provided above) and stop (atleast temporarily) "Crap-pages" by being less reliant on keywords.
Obviously I do not know specifically how they plan on doing this, but they have big brains... I'm sure they have a way.
This is pure speculation, but it is certainly has a possibility.
[edited by: Livenomadic at 3:03 pm (utc) on Oct. 3, 2004]
But this is all complete crazy speculation, I'm just expecting something BIG from google since we have seen so little activity, and clustering seems to be the next "BIG" in the search engine world.
I agree. I doubt that many searchers are dissatisfied with the current Google results. If it ain't seriously broke then no need for Google to "fix" it with a major change like this.
Yepp. The whole information science could be 20 years ahead if it had incorporated more from linguistics and language-philosophy than just Chomsky-Grammars on Processor-level.
One of my favourites is Nietzsche's talking about Gedanken-Dichtung. Of course you would translate Dichtung by poetry, which is said to be somewhat useless in big business, but it also has this connotation of "density", and you are not far away from a "cluster" then, are you?
sry if OT again but livenomadic how did you manage to make
multiple cLUSTers
pass the spam filters?
Oliver
4. Search engine users are fickle, whomever provides the better results wins. (this is how google won the first war).
They are not fickle. They are ignorant. If you had access to as many log files as some of us do you would know that users make no sense when they search. I had one site that got lots of visits from people typeing in distance to country. Where country was whatever country they wanted to know the current distance to. Do you know how many people type in www.domain.com and then go wow G is so great it found the site I was looking for. The amount of users that really care about how good a SE really is and would switch would not even beat out ASk.com.
If you know anything about marketing you know that all you have to do is be number one and stay there long enough to have mindshare. All you have to do to stay is to not take away what they are used to. Don't give all the old netscape examples they are not relavant. The internet is at a place now where the avg person usees it. It was not back then. G has hit a critcal mass. MS should be number one it is the first thing anybody sees on their computer. They have to look at MSN search and then type in www.google.com.
Perhaps it’s one of the next search mediums but G wouldn’t switch over its main algo to this method of indexing since March. It would have to be tested and tested and tested again (by the public) and when we figure out how it works we'd optimise for it, and then the crap will float back to the surface ; )
Revolution - Not quite, but I'm still battening down those hatches, I got me a woolly hat, tea bags and lots of tined food.
Yeah, I can't imagine Google doing a major change like this without a public beta. Google is #1. They wouldn't want to make any major change without being sure the public liked it. If Google is thinking along these lines, then it has to be in the early stages.
I made a comment about "Clusty" in another thread [webmasterworld.com] (apparently it was a conversation-stopper!). I wondered afterwards if I was being too harsh, but overall I stick by what I said.
The only thing noteworthy about Clusty is that they have managed to get some decent media exposure (NYT is good stuff for a tech start-up these days), but it is in the context of a certain media hunger for news about search and Google being in their quiet period (now finished) - the media are going to take what they can get.
Great theory, Livenomadic!
Google shouldn't be complacent. However, they shouldn't leap into making major changes unless they have good data that the users will prefer the new, improved Google. Note Google got rid of the links in the SERPs to the Google directory. Given the poor quality of search terms many people use, I'd have thought these directory links were useful. With them, so long as one relevant site showed up in the SERPs, people could click on the link and find other relevant sites. Google must have done market research that showed people didn't want the directory links. Note that ask.com already has those "related links" similar to Clusty. I hadn't heard that people are flocking to ask.com away from Google in large numbers. People seem to like Google as it currently is: nice and simple.
But I immediately assoiciated a GUI-based application. I don't have the source at hand, but recently I came across a site which offered a flash-based graphical display of the related:#*$!!-search query on google (they kept the results of the link:xxx-search for themselves I presume). Maybe one of the next generation of search engines presents its results as an image map with the old netscape wheel at the bottom, showing the site names plus the way these sites are embedded into their cluster.
On the other hand I must insist on obeying the hint towards semantics, which glengara brought into the discussion. I mean beginning with aristotle people tried to bring order into kowledge, even fought wars about this. Think of these endless discussions within (at least the german branch) of DMOz about how to structure various categories. Do you really think this might be done by an algorithm? It might be done on a statistic basis, it might be done by the number of backlinks like google does now, but i yet don't see any way it might be done concerning "meaning." And if it claims to do, I am looking forward to seeing its search results on querying for "meaning."
"taking a person from a dark room into a bright one"
"grey water treatment at the cottage"
"popcorn kernel stuck in teeth"
Seriously, those queries won't find anything related to the person's query. At least no good solution to them. I would have just searched for something more like this:
"transition from dark to light room"
"grey water treatment"
-- I wouldn't even bother with this one. Call the dentist instead.
Query: pharmacy
---
chemist
chemist's
pharmacist's
prescriptions
rx
drug
pharmacies
pharmacy
drugs
drugstore
pharmacist
pharmaceutical It's all over their index. It's just not obvious. I think it's related to the sandbox, in fact.
But it's really, really good.
Clusty/Vivisimo is a really, really good engine. It's just implemented very poorly. G's implementation makes it transparent to the user.
That's the magic ticket.
Is the search engine there to serve my needs or is it vice versa meanwhile? I think you are overgeneralizing your own dependency on SEs as a webmaster if you argue like that.
> But it's really, really good.
From a linguistic point of view SEs still concentrate far too much on noun-phrases, but the key-clues to the efficiency of language and thought are what is generally subsumed under "particles". To take the above example:
> "taking a person from a dark room into a bright one"
though i admit I neither have any idea what the author was searching for, the query gives some interesting hints to what SEs do NOT cover yet:
1) verbs and any kind of action (perhaps he was searching for self-illuminating door handles; the concept of opening a door immediately comes to the mind of a human reader, while I see no way a search on related NPs would lead from "room" to "door" at present state of technology. You'd rather receive hundreds of adds on "real estate" instead.)
2) the fuzzyness of adjectives (bright brighter the brightest, and what about "dull" users of search engines?)
3) Beyond the basic logical concepts of "and," "or" and the like I see no real progress since the basic works of Frege and Russel/Whitehead 100 years ago. Even a minimum of the concepts of set-theory has not yet been implemented, though it should be clear that a thorrough implementation of prepositions would be very very helpful for the users. e.g. "Who was king of England immediately before Henry VIII?"
I tried to quote googles answer telling me that "who," "was" and "of" are ignored on the search but I have no access to the english version of "all about google" (*g*).
Maybe we should not expect too much yet, and maybe the idea of semantic clusters incorporated by a money-making search engine is a first and important step towards a solution of the world-knowledge-barrier, which has so long prevented the promises of artificial intelligence from becoming real.
This is why Google is working on the largest bayesian database of clusters to determine the most likely meaning for any given search request.
A few other interesting things in there as well.
<cit>
Discard articles, prepositions, and conjunctions
Discard common verbs (know, see, do, be)
Discard pronouns
Discard common adjectives (big, late, high)
Discard frilly words (therefore, thus, however, albeit, etc.)
Discard any words that appear in every document
Discard any words that appear in only one document
<end cit>
nevertheless I'd agree there is something on the way. I am currently working on a little script to execute such a vector-space-based analysis of two websites in order to find out whether the results might explain the differences in pagerank of two pages with identical inbound-link-structure. However, I lack a good list of such "stopp-words" in either english and/or german. So far I use a workaround "delete any word containing less than five letters but this is not very satisfying. Any ideas?