Welcome to WebmasterWorld Guest from 22.214.171.124
Search engine giant Google recently acquired an advanced text search algorithm invented by Ori Alon, an Israeli student... Orion, as it is called, which Alon developed with faculty, relates only to the most relevant textual results..."For example, if you search information on the War of Independence, you'll receive a list of related words, like Etzel, Palmach, Ben-Gurion," he explained. The text will only appear on the results page if enough words relevant to the search and the link between them is reasonable.
<quote reduced to 4 sentences
See Terms of Service [webmasterworld.com]>
[edited by: tedster at 5:56 pm (utc) on April 9, 2006]
Their algorithm must change to reflect the information in the manner that humans are requesting it .. not in terms of the webmaster author who wants to trick the SE to get their site on the first result page.
Very good point, Christopher.
Bank tellers used to be taught that the best way to detect counterfeit money was to know the real stuff so well that the fakes stood out.
If Google improves how they determine relevance, then the flip side is a reduction in spam.
Australian Copyright Law is very different from US.
First Published implies ownership and this is protected
by Legislation; if you publised something second, you
have to prove you didn't copy the original. The process
is automatic and requires no registration.
Similarly, Australian Patent Law is different from US
so those who are busy looking for a Patent or Copyright
may not find one.
The Australian Legislative process is recognised in
International Law so an Australian Patent or
Copyright is automatically recognised in the US
without need for a second application or registration.
"Clustering" has been around for a while, and I know of at least one other project in Sydney that was publicised on the BBC last year using clustering techniques.
Orion may be a great advance , but i wouldn't hold your breath about it being the bastion of sustainable structural discipline. Three years or so ago the "HillTop Algorithm" patent was the talk ... not sure if it was used or superceded.
Unless you have everyone forced to define their sites & data to a convention there will be continual ongoing upgrade work required.
Allon offers a search on the "American Revolution" as an example of how the system works. Orion would bring up results with extracts containing this phrase, but it would also give results for American History, George Washington, American Revolutionary War, Declaration of Independence, Boston Tea Party and more.
Broadmining not only gives associated phrases, but also clustering the phrases.
Search "American Revolution"
Continental Congressmen: George Washington John Joachim Zubly John Adams Thomas Jefferson Alexander Hamilton Benjamin Franklin more...
U.S. colonial history: Colonial government in America Boston Tea Party Quebec Act Pennamite Wars Writ of Assistance Colonial America more...
American Revolution: Olive Branch Petition Founding Fathers of the United States Boston Massacre Committee of correspondence more...
RELATED: History of the United States Constitution Nobles' Democracy
Whether this is used in its entirety or is absorbed into the grand scheme of things its certainly turned into great marketing-How many column inches have been devoted to this?
Isn't that what this kind of thing is about? Getting people to believe Google is cutting edge, instead of Google demonstrating it?
It's all hype
I'd say the 4 month long updates and the general messing around with the index fall into the same category. They may dazzle the rest of the world with their hi-tech, uber cool approach to technology, bamboozle investors with their forays into competitors territory (IMs, chat, possibly online word processing etc), but that doesn't really fool the people who are analysing their search results in detail. So perhaps this is the little slice of marketing aimed at us, since we won't be as easily fooled by stuff we think they have no chance of succeeding in.
I am of the view Google have to be judged solely on the basis of how they perform right now, and not on the basis of what they may or may not do in future. And my conclusion is that they are an ok search engine, better than most. But it won't last.
Although I welcome the thought of them exploring more sophisticated ways of conducting search, I'm not really sure I care. It's their job to stay ahead of the game. In any event, they've never once addressed one of the major issues associated with spam, namely the conflict of interest caused by AdSense. They make a ton of money from total crap.
The average user won't even hear about it.
This is true. Most people don't care, and a lot don't even realise there's advertising on Google (for example). In other words the detail is strictly for professionals, and other search engines.
Obscure no-name researchers in Australia aren't going to change a thing in terms of perception.
Until the kind of rumour talked about in this thread is anything like fact I don't care.
I have no doubt Google already have the know-how, it is already probably built into their existing ranking algo.
Beside the fact that I remember I've read about proximity search being included in algo (don't remember if for google, msn or yahoo) some time ago, I think Google included it already into their ranking algo.
Google use it for AdWords for a while ...
Take a look at keyword tool - you can find related keywords, but then look for alternative keywords too. It is true, alternative keywords search returns more or less logic results, anyway site-related Keywords tool works much better.
take a look and then use site-related keywords results into your link campaing :)
I apologize if this question comes as a strange one, but after reading every link to the story and everybody's postings I still don't quite grasp what Ori Allon's algorithm does. I know it is not an entire SE per se, so it doesn't crawl, I take it Ori must had been working on Google's results (guess that now he has full access to the databases huh?).
Does it trim spam and unrelated content out of a search, or does it search only for related content (I hope I'm clear enough on that, I think they're different things)?
Added to my confusion is the fact that it doesn't surprise me that much that the early Orion could show relevant excerpts indicating potentially useful comment, since for instance I can do that myself just by looking at Google's results as can everybody else right? (i.e. to know whether a result is a spoof or "real" content just by looking at the snippet without actually browsing into it).
I don't know what you know about information management and information hierarchy, but basically it follows a form or standard similar to a library cataloging system.
Back in the 'old days' we had a Dewey decimal system which rated books in terms of subject-fields, sub-fields,categories .. and so on.
This system does a similar sort of thing .. apart from extracting the search term in the context of the site, it also puts it in terms of all of the categories, fields and subfields that the site as a whole covers.
The word "demograph" could also be used here so that, having analysed the context of all of the results for a particular search term, it looks at how the term is used.
For eggsample: if I wanted to research "the percentage of fertile Blue Wyandot Eggs and the rate of deformity of the hatchlings in Lower Bumplesteen in Craddock", this algorithm would allow me to find this in context because it will have indexed all of these terms and related them.
Perhaps you could use the term "Relational Database" as well as "Clustering" as well as "context" ...
So, Mr and Mrs Spam who set up a site that has an automatic page generator that picks up any of the terms in my search string is going to rank much lower; for example, if the spam page was about "Blue Wyandot Eggs", it would also have to be related to everything about a "Blue Wyandot", which is a breed of farm chicken (or poultry) ... etc .. and the more "relationships" the term has the better.
thus a page that is SEO'd for Wyandots is not going to rank well unless it has references within the SITE to what a Wyandot actually is and does .. to provide the context.
Read up a bit about cataloging and Digital Information Management Systems .. its actually a highly specialised field.
In other words, due to some pressure that has been applied to them, Google are moving beyond "web page indexing" to "contextual digital archive retrieval" and that's a whole different ball game, logic, and way of thinking.
If you can access "www.dialog.com", you might get an idea of a very grass roots "intelligent database"
Is this helpful for you?
Secondly, your explanation is clear to me, yet I feel like pointing out, is it specifically this ability to search only "true" content and discard SEO spam what makes Orion so unique?
Maybe this will sound too bold from me, but it doesn't surprise me much. I apologize for I'm not an Information Specialist. I only know programming and I am good at regular expressions and data mining. In a nutshell, part of my job is to write web crawlers that gather specific data. Perhaps it is the fact that I'm only taking into consideration the "nuts & bolts" that my vision of Orion is too narrow, but I don't think it is such a breakthrough since it is my opinion that anyone with intermediate skills can develop a tool to compare content and rid out plain spam.
I understand what you are saying and asking - and from the perspective you are asking it.
The difference is that this "intelligent database" and "query engine" was designed to catalog information; the elimination of spam happens because of the way it does this.
It was not designed with any purpose of eliminating spam - spam was eliminated 'by default" and there is no need for "spam filtering" - unless the spam can "intelligently" be generated by a similar program
Do you understand the "inversion" in the logic here?
Qualify and quantify the information "dynamically" and the index can't be spammed; this happens because the result page is generated from relational databases and not stored as an index or index/content;
in other words, the conventional idea of "cache" goes out the window.
think of a massive MyQSL database, relationship queries and dynamically generated nested templates which assign the queries according to how the database feeds back into the template generator
nothing is eiminated, nothing is rejected, nothing is blocked.
the only way to get a spam page to the top of the results would be to shift the weighted average by publishing enough sites to shift that average in your favor .. and, even then, you would only have influence in one or maybe two contextual relationships.
However, if you did this, duplicate pages would show up in the results as one "context" repeated in a number of sites . this appears then on a supplementary page, not on the first page of results.
Another by-product of this is that "date of publishing" will give some relevance to copyright but originality will not be penalised.
So basically, with the help of Orion, your search results will know be "knowledge", instead of "all bytes that matched the query".
Just out of pure curiosity. Suppose there was this program you can give keywords like "blue disc".
Then the program makes a search and returns:
"Blue Ray DVD technology"
"Blues Music CDs"
"Colored CD-Rs for sale"
As a disambiguation, from which there you can browse further.
Should a program like that appear all of a sudden on eBay, how much do you people think it would sell in?