Forum Moderators: open
"Any clue as to the possible role greater reliance on semantics is playing in your never ending quest for more relevant results?"
I'd say that's inevitable over time. The goal of a good search engine should be both to understand what a document is really about, and to understand (from a very short query) what a user really wants. And then match those things as well as possible. :) Better semantic understanding helps with both those prerequisites and makes the matching easier.
So a good example is stemming. Stemming is basically SEO-neutral, because spammers can create doorway pages with word variants almost as easily as they can to optimize for a single phrase (maybe it's a bit harder to fake realistic doorways now, come to think of it). But webmasters who never think about search engines don't bother to include word variants--they just write whatever natural text they would normally write. Stemming allows us to pull in more good documents that are near-matches. The example I like is [cert advisory]. We can give more weight to www.cert.org/advisories/ because the page has both "advisory" and "advisories" on the page, and "advisories" in the url. Standard stemming isn't necessarily a win for quality, so we took a while and found a way to do it better.
So yes, I think semantics and document/query understanding will be more important in the future. pavlin, I hope that partly answers the second of the two questions that you posted way up near the start of this thread. If not, please ask it again in case I didn't understand it correctly the first time. :)
http://www.google.com/contact/spamreport.html
or write webmaster(at)google.com
As long as 64. is staying stable I'm OK to wait, the others are just too surrealistic.
For LSI, this is what I've got bookmarked
[javelina.cet.middlebury.edu...]
[edited by: Marcia at 8:40 am (utc) on Feb. 17, 2004]
Do you realise how annoying it is to get notifications in and check the results just to see more insignificant nonsense from you numbers game punters saying I got this and I got that?
Be considerate to those of us who don't give a toss what you are seing in beautiful downtown Burbank. I'm getting angry ;-{
Um, try to keep up with the program. Nobody is talking about that.
It seems the shakeup has settled down now, temporarily at least. The only lasting effect I'm seeing is that a lot of fresh piddle was introduced, and the results have degraded somewhat.
Maybe it was just introducing fresh pages before moving 64 over, but I sure hope they don't do that again anytime soon. That was genuinely scary.
Some of us must have commented too early and those comments got deleted. This was not normal fluctuation and was not anything like 64, 216, www. or anything else ever seen. It was as if most of the algo and all of the filters had been turned off.
Single IP addresses were fluctuating wildly, giving different results every time you hit the refresh button.
Beedee
you were told "not" have email notifications on this thread as it would be a large one, it's your choice to look at this thread
You can also find the CIRCA semantics paper salted away if you know where to look ;)
A brief and very simple summary of what I IMHO think this has to do with this forthcomming (can't come soon enough for me) update. Think of the analogy of finger print analysis. The analyser only looks at ceretain types of feature, whirls, intersections, branches etc and marks their location. The analyser ignores all of the straight uninteresting lines that every finger print has on it. Latent semantic indexing does the same with words, it ignores all of the straight forward words and concentrates on the words that have real meaning. The CIRCA Ontology defines the closeness of match of these words and creates a single statistical vector for each page. The Google algo uses this as a contributor to the SERPs.
The signs are that htis overwhelmed the "old" part of the algo in Florida and to a greater extent in Austin. Now in my opinion either they have, through a process of trial and error, removed or added back an extra feature into the semantic analysis or they have up-weighted part of the old algo designed to bring back the micro relevant sites. Whichever way they have done it, it has worked pretty well in some areas.
I'm becomming convinced that the same technology is spotting dupes. If two pages have the same vector they are the same. Since latent semantic indexing aims to throw out things that don't help it to compare a group of documents, I guess that the first thing it would throw out is duplicates. Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains. I think that this explains the unexplained complete drop from SERPs of previously high ranking pages since the Florida update and possibly before.
The Brandy update adds in or takes out a minor ingredient but LSI/CIRCA is a big part of the recipe.
Best wishes
Sid
<Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains.>
Unfortunately I was one of these sites that lost all ranking but I have just installed a 301 redirect and hopefully this will get me out of jail. Has anyone who suffered a similar fate as a result of Austin/Brandy recoverd yet? If so was it done through a 301 and how long did it take?