Update Brandy Part 3 - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Update Brandy Part 3

«
1
2
3
4
5
6
7
8
9
10
11
»

GoogleGuy

7:41 pm on Feb 15, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Continued From: [webmasterworld.com...]

"Any clue as to the possible role greater reliance on semantics is playing in your never ending quest for more relevant results?"

I'd say that's inevitable over time. The goal of a good search engine should be both to understand what a document is really about, and to understand (from a very short query) what a user really wants. And then match those things as well as possible. :) Better semantic understanding helps with both those prerequisites and makes the matching easier.

So a good example is stemming. Stemming is basically SEO-neutral, because spammers can create doorway pages with word variants almost as easily as they can to optimize for a single phrase (maybe it's a bit harder to fake realistic doorways now, come to think of it). But webmasters who never think about search engines don't bother to include word variants--they just write whatever natural text they would normally write. Stemming allows us to pull in more good documents that are near-matches. The example I like is [cert advisory]. We can give more weight to www.cert.org/advisories/ because the page has both "advisory" and "advisories" on the page, and "advisories" in the url. Standard stemming isn't necessarily a win for quality, so we took a while and found a way to do it better.

So yes, I think semantics and document/query understanding will be more important in the future. pavlin, I hope that partly answers the second of the two questions that you posted way up near the start of this thread. If not, please ask it again in case I didn't understand it correctly the first time. :)

tigger

8:34 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

try webmaster@google.com and put webmaster world in the title

Marcia

8:36 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here's the report, Jens

http://www.google.com/contact/spamreport.html

or write webmaster(at)google.com

As long as 64. is staying stable I'm OK to wait, the others are just too surrealistic.

For LSI, this is what I've got bookmarked

[javelina.cet.middlebury.edu...]

[edited by: Marcia at 8:40 am (utc) on Feb. 17, 2004]

BeeDeeDubbleU

8:40 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Why are you all still rattling on about 64 and 236 when you were requested to stop on more than one occasion and what good does it do? This could go on for a few more days yet so there is absolutely no point in this!

Do you realise how annoying it is to get notifications in and check the results just to see more insignificant nonsense from you numbers game punters saying I got this and I got that?

Be considerate to those of us who don't give a toss what you are seing in beautiful downtown Burbank. I'm getting angry ;-{

steveb

8:44 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

"This could go on for a few more days yet so there is absolutely no point in this!"

Um, try to keep up with the program. Nobody is talking about that.

It seems the shakeup has settled down now, temporarily at least. The only lasting effect I'm seeing is that a lot of fresh piddle was introduced, and the results have degraded somewhat.

Maybe it was just introducing fresh pages before moving 64 over, but I sure hope they don't do that again anytime soon. That was genuinely scary.

tumpy

8:47 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Well, at last the 64..serps are being reflected in www3 datacenters! Anyone else seeing it as well?

NeverHome

8:49 am on Feb 17, 2004 (gmt 0)

10+ Year Member

With respect BeeDeeDubbleU, what just took place was not simply "rattling on about 64 and 236". Something quite strange flashed passed a few of us, and it certainly was worthy of comment. Enough said. :)

BeeDeeDubbleU

8:50 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Aaaaaarghhhh!

:->

Crush

8:51 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Looks like today is the day or if you are in the states tonight is the night.

Finally I can see on-in and -cw the 64... changes. Also I confirm what you can see tumpy on www3. So looks like it will finally be reality.

vrtlw

8:55 am on Feb 17, 2004 (gmt 0)

10+ Year Member

That was genuinely scary.

It certainly was and I would also like to mention that Brett requested quite categorically not to turn on notifactions for the update threads. Even if not for your own personal sanity then for the sendmail on the server.

quotations

9:05 am on Feb 17, 2004 (gmt 0)

10+ Year Member

>Something quite strange flashed passed a
>few of us, and it certainly was worthy of comment.

Some of us must have commented too early and those comments got deleted. This was not normal fluctuation and was not anything like 64, 216, www. or anything else ever seen. It was as if most of the algo and all of the filters had been turned off.

Single IP addresses were fluctuating wildly, giving different results every time you hit the refresh button.

tigger

9:26 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>Do you realise how annoying it is to get notifications in and check the results just to see more insignificant nonsense from you numbers game punters saying I got this and I got that?

Beedee

you were told "not" have email notifications on this thread as it would be a large one, it's your choice to look at this thread

Powdork

9:31 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Maybe it was just introducing fresh pages before moving 64 over, but I sure hope they don't do that again anytime soon. That was genuinely scary.

Scary, huh? Now you have a small taste of what it's like when things that used to work don't "get the same credit anymore".;)

steveb

9:43 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

No Powdork, not at all the same.

The next time someone posts about "why doesn't Google do this algo change/update out of the public eye" some of us will now have an idea of what they probably see privately before they show it to us.

Napolean or Everyman would have croaked if they saw that.

Hissingsid

9:43 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Someone asked for a link to the Latent Semantic Indexing paper LSI Paper [javelina.cet.middlebury.edu]. There you go.

You can also find the CIRCA semantics paper salted away if you know where to look ;)

A brief and very simple summary of what I IMHO think this has to do with this forthcomming (can't come soon enough for me) update. Think of the analogy of finger print analysis. The analyser only looks at ceretain types of feature, whirls, intersections, branches etc and marks their location. The analyser ignores all of the straight uninteresting lines that every finger print has on it. Latent semantic indexing does the same with words, it ignores all of the straight forward words and concentrates on the words that have real meaning. The CIRCA Ontology defines the closeness of match of these words and creates a single statistical vector for each page. The Google algo uses this as a contributor to the SERPs.

The signs are that htis overwhelmed the "old" part of the algo in Florida and to a greater extent in Austin. Now in my opinion either they have, through a process of trial and error, removed or added back an extra feature into the semantic analysis or they have up-weighted part of the old algo designed to bring back the micro relevant sites. Whichever way they have done it, it has worked pretty well in some areas.

I'm becomming convinced that the same technology is spotting dupes. If two pages have the same vector they are the same. Since latent semantic indexing aims to throw out things that don't help it to compare a group of documents, I guess that the first thing it would throw out is duplicates. Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains. I think that this explains the unexplained complete drop from SERPs of previously high ranking pages since the Florida update and possibly before.

The Brandy update adds in or takes out a minor ingredient but LSI/CIRCA is a big part of the recipe.

Best wishes

Sid

BeeDeeDubbleU

10:05 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Another great contribution Sid! (assuming that you are sure that what you are saying is correct :-)

<Too bad for folks on servers that serve up the same pages on www and non-www versions of their domains.>

Unfortunately I was one of these sites that lost all ranking but I have just installed a 301 redirect and hopefully this will get me out of jail. Has anyone who suffered a similar fate as a result of Austin/Brandy recoverd yet? If so was it done through a 301 and how long did it take?

WebmasterFisherman

10:09 am on Feb 17, 2004 (gmt 0)

Anyone could you please give a full list of 64 ips to check:

is there are just 2 of them: 233.161.104 and 233.161.99?

Hissingsid

10:30 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

64.233.161.98
64.233.161.99
64.233.161.104

I'm about as sure as I am that the World is not flat and NASA filmed the Lunar landings in the Nevada desert ;)

Oh and Googleguy confirmed that, to paraphrase, "they have found a better way of doing semantic indexing". If it walks like a duck, quacks like a duck and the best ornithologist you know says its a duck, I think its afe to assume that its a duck. Now we know its a duck we can assume that it likes splashing about in ponds, the rain, quacking outrageously at duck jokes etc.

If they are not using LSI to spot dupes what technology do you think they are using?

Best wishes

Sid

edit reason: This CGI is screweing up my posts again

adfree

10:36 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Marcia, Sid, thanks for the details and your analysis, this presents food for thought and proves that G is again ahead of the game.
Jens

SyntheticUpper

10:40 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Hi Sid,

To spot a dupe, both pages would have to show up in the exact same vector position. A single additional token word recognised by the semantic indexing would move the dupe site to a different position in the vector space. Also, with pages containing a very small number of token words, it's not inconceivable that two totally different pages might occupy the same position in the vector space. Just my 2 cents worth, but I'm not sure LSI could esily be used for dupe content spotting.

wine_guru

10:46 am on Feb 17, 2004 (gmt 0)

10+ Year Member

here in UK I'm seeing 64.**** data on www2 and www3, but not yet on .co.uk. It wasn't there when I last looked late last night, so maybe it's moving over during today. Hope so :)

andy_boyd

10:57 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Steady 64 on www2 and www3 from Northern Ireland, not yet on www.google.com. Looks like it will be rolling out today. :-)

wine_guru

11:01 am on Feb 17, 2004 (gmt 0)

10+ Year Member

" rolling out today - hope so :)
Right, back to the salt mine and writing our new wine list as there's not much point in doing too much else until this settles down

Just Guessing

11:02 am on Feb 17, 2004 (gmt 0)

10+ Year Member

I would say that changes to duplicate content spotting has been a big part of both Austin and Brandy.

It also seems to me in some cases that the surviving page gets a boost in the rankings from eliminated pages with duplicate content that link to it. I would guess this would only be the case if the pages are not seen as affiliated.

Does this fit with anything anyone else is seeing? - Sid?

George Abitbol

11:03 am on Feb 17, 2004 (gmt 0)

10+ Year Member

64 showing up on www2 and www3 in france
Also on www-in and www-cw

Tiebreaker

11:10 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Can someone else compare their results on 64 with their results on google.ca?

My results on 64 are great - but on google.ca they are even better still!

I'm thinking that maybe canada has the 64 results, but with the benefit of backlinks added or something - it's been like that consistently for the last couple of days

mbauser2

11:13 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Google has put to much weight on page linking. While so many webmasters have purchased links on high pr rankings sites just to get there page ranks higher, does not mean they have a high quality site. Good quality site should have nothing to do with who is linked to you.

*Sigh* The Same Old Delusion returns. Sometimes, I hate new users.

So you think it's unfair to use a system that takes into account multiple opinions about your site, and that it would be more fair to switch to a system that only uses one opinion of your site? Because that's what you get if you throw away citation analysis: An engine from the bad old days, when everything depended on The Secret Algorithm, and we had absolutely no chance of recognizing or resisting arbitrary filters. Anonymous programmers decided what was important to everyone.

It's truly frightening how many webmasters cry out for a return to search engine dictatorship whenever democracy fails to give them what they want.

WebmasterFisherman

11:15 am on Feb 17, 2004 (gmt 0)

Results on .ca, www2, www3 are DIFFERENT from all I see on 64 for some of my terms. Some are the same others aren't.

It's not rolled in yet at least not entirely

Netzen

11:24 am on Feb 17, 2004 (gmt 0)

10+ Year Member

Seeing 64 results in on www2 and www3 google.com in Germany.
What is www2? A Backup Server?

Hissingsid

11:52 am on Feb 17, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Does this fit with anything anyone else is seeing? - Sid?

Not sure.

Just to clarify something. Naive Bayes = simple page semantic analysis. Things like spam filters on email progs.

Latent semantic indexing = much more accurate.

CIRCA = several orders more accurate than LSI because of its huge Ontology

CIRCA + Google = killer solution. Add what Google knows about pages to what CIRCA senses about pages and linked pages and you should have a very accurate system for SERPs and spotting dupes.

Re Dupes: Its not just one measure thats used. In fact it could be a cascade. If 95% plus certain of dupe then cross reference other algo components.

LSI is like an evolutionary step on the way towards what Google is (in part) implementing now as PART of its algo. If you understand something about LSI then you start to understand what is going on in SERPs.

Many of the papers on LSI and similar analysis methods talk about the use of training sets of data to teach the algorithm right from wrong. I wonder if this is what we are seeing now, ie Google/CIRCA gets to forth grade. If that is the case then this is the first of a much improved implementation of the new technology and it could get better with each update. What a shame "better" is such a subjective word and "ones mans meat is another mans poison".

Best wishes

Sid

Just Guessing

11:54 am on Feb 17, 2004 (gmt 0)

10+ Year Member

It's truly frightening how many webmasters cry out for a return to search engine dictatorship whenever democracy fails to give them what they want.

In this democracy, those that got the vote (PR) in the last election, get to choose who wins in the next election - that's quite often how dictatorship starts.

This 327 message thread spans 11 pages: 327

«
1
2
3
4
5
6
7
8
9
10
11
»