Forum Moderators: open
"Any clue as to the possible role greater reliance on semantics is playing in your never ending quest for more relevant results?"
I'd say that's inevitable over time. The goal of a good search engine should be both to understand what a document is really about, and to understand (from a very short query) what a user really wants. And then match those things as well as possible. :) Better semantic understanding helps with both those prerequisites and makes the matching easier.
So a good example is stemming. Stemming is basically SEO-neutral, because spammers can create doorway pages with word variants almost as easily as they can to optimize for a single phrase (maybe it's a bit harder to fake realistic doorways now, come to think of it). But webmasters who never think about search engines don't bother to include word variants--they just write whatever natural text they would normally write. Stemming allows us to pull in more good documents that are near-matches. The example I like is [cert advisory]. We can give more weight to www.cert.org/advisories/ because the page has both "advisory" and "advisories" on the page, and "advisories" in the url. Standard stemming isn't necessarily a win for quality, so we took a while and found a way to do it better.
So yes, I think semantics and document/query understanding will be more important in the future. pavlin, I hope that partly answers the second of the two questions that you posted way up near the start of this thread. If not, please ask it again in case I didn't understand it correctly the first time. :)
http://www.google.com/contact/spamreport.html
or write webmaster(at)google.com
As long as 64. is staying stable I'm OK to wait, the others are just too surrealistic.
For LSI, this is what I've got bookmarked
[javelina.cet.middlebury.edu...]
[edited by: Marcia at 8:40 am (utc) on Feb. 17, 2004]
Do you realise how annoying it is to get notifications in and check the results just to see more insignificant nonsense from you numbers game punters saying I got this and I got that?
Be considerate to those of us who don't give a toss what you are seing in beautiful downtown Burbank. I'm getting angry ;-{
Um, try to keep up with the program. Nobody is talking about that.
It seems the shakeup has settled down now, temporarily at least. The only lasting effect I'm seeing is that a lot of fresh piddle was introduced, and the results have degraded somewhat.
Maybe it was just introducing fresh pages before moving 64 over, but I sure hope they don't do that again anytime soon. That was genuinely scary.
Some of us must have commented too early and those comments got deleted. This was not normal fluctuation and was not anything like 64, 216, www. or anything else ever seen. It was as if most of the algo and all of the filters had been turned off.
Single IP addresses were fluctuating wildly, giving different results every time you hit the refresh button.
Beedee
you were told "not" have email notifications on this thread as it would be a large one, it's your choice to look at this thread
You can also find the CIRCA semantics paper salted away if you know where to look ;)
A brief and very simple summary of what I IMHO think this has to do with this forthcomming (can't come soon enough for me) update. Think of the analogy of finger print analysis. The analyser only looks at ceretain types of feature, whirls, intersections, branches etc and marks their location. The analyser ignores all of the straight uninteresting lines that every finger print has on it. Latent semantic indexing does the same with words, it ignores all of the straight forward words and concentrates on the words that have real meaning. The CIRCA Ontology defines the closeness of match of these words and creates a single statistical vector for each page. The Google algo uses this as a contributor to the SERPs.
The signs are that htis overwhelmed the "old" part of the algo in Florida and to a greater extent in Austin. Now in my opinion either they have, through a process of trial and error, removed or added back an extra feature into the semantic analysis or they have up-weighted part of the old algo designed to bring back the micro relevant sites. Whichever way they have done it, it has worked pretty well in some areas.
I'm becomming convinced that the same technology is spotting dupes. If two pages have the same vector they are the same. Since latent semantic indexing aims to throw out things that don't help it to compare a group of documents, I guess that the first thing it would throw out is duplicates. Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains. I think that this explains the unexplained complete drop from SERPs of previously high ranking pages since the Florida update and possibly before.
The Brandy update adds in or takes out a minor ingredient but LSI/CIRCA is a big part of the recipe.
Best wishes
Sid
<Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains.>
Unfortunately I was one of these sites that lost all ranking but I have just installed a 301 redirect and hopefully this will get me out of jail. Has anyone who suffered a similar fate as a result of Austin/Brandy recoverd yet? If so was it done through a 301 and how long did it take?
is there are just 2 of them: 233.161.104 and 233.161.99?
I'm about as sure as I am that the World is not flat and NASA filmed the Lunar landings in the Nevada desert ;)
Oh and Googleguy confirmed that, to paraphrase, "they have found a better way of doing semantic indexing". If it walks like a duck, quacks like a duck and the best ornithologist you know says its a duck, I think its afe to assume that its a duck. Now we know its a duck we can assume that it likes splashing about in ponds, the rain, quacking outrageously at duck jokes etc.
If they are not using LSI to spot dupes what technology do you think they are using?
Best wishes
Sid
edit reason: This CGI is screweing up my posts again
To spot a dupe, both pages would have to show up in the exact same vector position. A single additional token word recognised by the semantic indexing would move the dupe site to a different position in the vector space. Also, with pages containing a very small number of token words, it's not inconceivable that two totally different pages might occupy the same position in the vector space. Just my 2 cents worth, but I'm not sure LSI could esily be used for dupe content spotting.
It also seems to me in some cases that the surviving page gets a boost in the rankings from eliminated pages with duplicate content that link to it. I would guess this would only be the case if the pages are not seen as affiliated.
Does this fit with anything anyone else is seeing? - Sid?
My results on 64 are great - but on google.ca they are even better still!
I'm thinking that maybe canada has the 64 results, but with the benefit of backlinks added or something - it's been like that consistently for the last couple of days
Google has put to much weight on page linking. While so many webmasters have purchased links on high pr rankings sites just to get there page ranks higher, does not mean they have a high quality site. Good quality site should have nothing to do with who is linked to you.
*Sigh* The Same Old Delusion returns. Sometimes, I hate new users.
So you think it's unfair to use a system that takes into account multiple opinions about your site, and that it would be more fair to switch to a system that only uses one opinion of your site? Because that's what you get if you throw away citation analysis: An engine from the bad old days, when everything depended on The Secret Algorithm, and we had absolutely no chance of recognizing or resisting arbitrary filters. Anonymous programmers decided what was important to everyone.
It's truly frightening how many webmasters cry out for a return to search engine dictatorship whenever democracy fails to give them what they want.
It's not rolled in yet at least not entirely
Does this fit with anything anyone else is seeing? - Sid?
Not sure.
Just to clarify something. Naive Bayes = simple page semantic analysis. Things like spam filters on email progs.
Latent semantic indexing = much more accurate.
CIRCA = several orders more accurate than LSI because of its huge Ontology
CIRCA + Google = killer solution. Add what Google knows about pages to what CIRCA senses about pages and linked pages and you should have a very accurate system for SERPs and spotting dupes.
Re Dupes: Its not just one measure thats used. In fact it could be a cascade. If 95% plus certain of dupe then cross reference other algo components.
LSI is like an evolutionary step on the way towards what Google is (in part) implementing now as PART of its algo. If you understand something about LSI then you start to understand what is going on in SERPs.
Many of the papers on LSI and similar analysis methods talk about the use of training sets of data to teach the algorithm right from wrong. I wonder if this is what we are seeing now, ie Google/CIRCA gets to forth grade. If that is the case then this is the first of a much improved implementation of the new technology and it could get better with each update. What a shame "better" is such a subjective word and "ones mans meat is another mans poison".
Best wishes
Sid
It's truly frightening how many webmasters cry out for a return to search engine dictatorship whenever democracy fails to give them what they want.
In this democracy, those that got the vote (PR) in the last election, get to choose who wins in the next election - that's quite often how dictatorship starts.