Latent Semantic Indexing

Forum Moderators: open

Message Too Old, No Replies

Latent Semantic Indexing

marin

12:16 am on Jan 11, 2004 (gmt 0)

Please do not kill me because I remember you the semantic approach, but I definitely believe we must consider the LATENT SEMANTIC INDEXING [webmasterwoman.com]
theory; this put together things like semantic and stemming, and explains way our singular keywords are not affected

Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original

[edited by: tedster at 7:11 am (utc) on May 10, 2007]

Hissingsid

10:31 pm on Feb 17, 2004 (gmt 0)

Hi,

Lets not point folks in the wrong direction. Why would we assume that hey would try and analyse all 4.25 billion in one go using the new bits of the algo. There's no logic to it.

Its much more sensible to first find a relatively closely matching sub set and analyse that.

Best wishes

Sid

valeyard

12:36 am on Feb 18, 2004 (gmt 0)

Gut feeling tells me that the smaller you make the sample on which you apply LSI et al, the less valuable it is and the more prone to false results. Was Austin an attempt to do semantic indexing "on the cheap"?

It all reminds me a bit of the Black-Scholes options pricing fiasco.

Net_Wizard

2:16 am on Feb 18, 2004 (gmt 0)

If LSI was indeed applied to Austin then the concept is not ready yet for prime time because Google semantics are way off target and missing words that are closely related to the original query.

This alone probably resulted to deranking of a lot of sites/pages.

Hissingsid

7:49 am on Feb 18, 2004 (gmt 0)

Gut feeling tells me that the smaller you make the sample on which you apply LSI et al, the less valuable it is and the more prone to false results. Was Austin an attempt to do semantic indexing "on the cheap"?

Hi,

I disagree with this. For any search excluding a semantic analysis there are only a maximum of a few thousand meaningful pages. If you take the top slice of these ie the most meaningful and apply a semantic analysis to them then the results are likely to be very accurate.

I think that the problem with many terms was what was thrown out to leave the latent bit and what was left in as a result. In my own niche by throwing out the actual term searched for you were left with pages without synonyms or close semantic matches so the indexing was based on fairly weak semantic matches. Brandy seems to allow the actual term to remain and be used in the indexing for terms where synonyms and stems are rarely used by web designers.

Best wishes

Sid

valeyard

8:34 am on Feb 18, 2004 (gmt 0)

Sid,

My understanding is that a big advantage of LSI et al is that as well as simply comparing different documents it can also unearth ones you might have missed. Finding the buried treasure in the mounds of crud, something that is key to a good search engine. By initially restricting the set on which you're doing LSI you have already thrown away many of these hidden treasures.

In addition, you have thrown away the majority of the datasets for comparison, making it more difficult to spot which pages are unusually similar to each other or unusually relevant.

By simplifying the process to workable levels one loses much of the subtlety and becomes vulnerable to hugely negative efects of unforseen factors such as a particular spammer/technique.

I totally agree that Brandy is giving weight back to on-page search terms. I've speculated a few times that Austin foolishly devalued TITLE and Hx tags, Brandy seems to be fixing this.

Hissingsid

8:56 am on Feb 18, 2004 (gmt 0)

My understanding is that a big advantage of LSI et al is that as well as simply comparing different documents it can also unearth ones you might have missed. Finding the buried treasure in the mounds of crud, something that is key to a good search engine. By initially restricting the set on which you're doing LSI you have already thrown away many of these hidden treasures.

In a perfect World yes, but we are talking about a ranking search engine in which its rare for users to look past #40 in SERPs so if you started with a subset of say 5000 documents that were all about a subject and applied it to those (and you get your loadings right) then you would have a very high probability of presenting the user with very good results. What is below 40 is almost completely irrelevant. I don't think Google is looking for perfection just to be better.

Best wishes

Sid.

jaffstar

9:53 am on Feb 18, 2004 (gmt 0)

Before we index our document collection, we first have to create a stop list of words to ignore. This list can range from the very simple to the very elaborate. A good starting point is the 100 most frequent words in the document collection, in all their inflected forms.

If we were targeting a keyword on our site, i.e. widgets. And we have a KW density of 8%. Surely according to LSI, this KW would be ignored?

Edit/found the answer.

Hissingsid

10:39 am on Feb 18, 2004 (gmt 0)

If we were targeting a keyword on our site, i.e. widgets. And we have a KW density of 8%. Surely according to LSI, this KW would be ignored?

Of course we don't actually know how Google has implemented this type of technology.

It could be argued that since (big assumption) they probably select the sample using something approximating to the old Google algo, it is safe to assume that the actual search term used would be widely used in every page in the set or on links pointing to that page. Since that term would be ubiquitous and therefore of little value in semantic indexing it could be discarded. I think that this is what happenned at Florida and applied to a larger range of terms at the Austin update. For some searches this threw up serious anomolies, place name service searches for example. If you take out the place name and the service from the search how the heck is the system supposed to discriminate what the page is about? I think that what we are seeing with Brandy is a downgrading of the weight of the semantic indexing component of the algo for terms where it just does not work with the data thats out there.

From reports that I've seen re city real estate searches etc the change to the algo has been doen by hand and they are still applying full strength semantic indexing to some towns and citiies. I would urge anyone effected by this to write to webmaster at google.com with updatebrandy in the subject and top of message. Explain in detail the terms that are causing a problem and your observations. I've sensed a change in attitude at Google and think that these reports are now taken seriously and are looked at buy humans.

Best wishes

Sid

valeyard

10:44 am on Feb 18, 2004 (gmt 0)

Sid,

...if you started with a subset of say 5000 documents...

Then you would have thrown away approximately 99.9999% of available data. Hardly a statistic to inspire confidence.

Imagine a dataset of a thousand points. A cluster of 100 near-identical spammy doorway pages will be obvious. Now imagine a dataset of 101 points. A cluster of 100 points will look "normal".

Don't get me wrong, I'm agreeing with you. I think that this is exactly what Google did in Austin and explains the dreadful results.

My current thoughts - they'll probably change by tomorrow :-)

- Applying sophisticated semantic analysis alone to the entire Google database for every query would probably give high quality results. Unfortunately it's impractical.

- So Google tried using simpler semantic analysis on a small subset of the database.

- The SERPs produced were at best unstable and at worst useless.

- So in Brandy they reapplied weightings for TITLE and Hx tags etc as a "sanity check".

As always, I'm just bouncing ideas around.

merlin30

11:21 am on Feb 18, 2004 (gmt 0)

"Brandy is giving weight back to on-page search terms"

This assumes that Brandy actually exists. There is absolutely no sign that the index spotted on those 64.* address is making its way to servers that serve www queries. I suspect that what has been christened Brandy is a red herring.

Hissingsid

11:30 am on Feb 18, 2004 (gmt 0)

Then you would have thrown away approximately 99.9999% of available data. Hardly a statistic to inspire confidence.

What they've actually done in this example is to pick the very best 0.0001%, the very pinacle of excellence ;) and then applied CIRCA to that.

Do you know how many tons of rock needs to be thrown away to extract 1 gram of gold?

Best wishes

Sid

shaadi

1:03 pm on Feb 18, 2004 (gmt 0)

what is CIRCA?

Liane

1:08 pm on Feb 18, 2004 (gmt 0)

what is CIRCA?

On or around a certain fixed date.

caveman

1:49 pm on Feb 18, 2004 (gmt 0)

Thank you Liane ;-)

JayC

1:55 pm on Feb 18, 2004 (gmt 0)

>> what is CIRCA?

Conceptual Information Retrieval and Communication Architecture. Essentially, Applied Semantics' software platform implementing LSI.

Hissingsid

1:58 pm on Feb 18, 2004 (gmt 0)

CIRCA is the technology Google bought when it aquired Applied Semantics Inc. If you do a search for CIRCA and perhaps add in the word semantics you should find the paper on it somewhere.

It sounds very much like CIRCA is a form Latent semantic indexing coupled with a sophisticated but unfortunately American English biased Ontology. An Ontology is a kind of super digital dictionary in which the interconectivity and associations of words is described by objective measurement.

I hope that this helps.

Best wishes

Sid

Liane

2:02 pm on Feb 18, 2004 (gmt 0)

Oooopps ... doh! :o

barbeta

2:05 pm on Feb 18, 2004 (gmt 0)

Let�s see what happens with the other languages...

SyntheticUpper

2:12 pm on Feb 18, 2004 (gmt 0)

If you do a search for CIRCA and perhaps add in the word semantics you should find the paper on it somewhere.

Or alternatively, add in the word 'moose', 'aardvark, or 'donut' :)

Hissingsid

2:42 pm on Feb 18, 2004 (gmt 0)

:o) LOL

I tried aardvark and it doesn't work.

Sid

landmark

3:58 pm on Feb 18, 2004 (gmt 0)

This argument is very persuasive ... until I apply it to the facts.

Each page of my example site is dedicated to manufacturer_x product_x (i.e. a different product on each page). Florida had no effect on the site. Austin killed half of the pages, but the other half survived and are still ranked top 10.

So I am looking for an explanation for why some keywords were affected, whilst others are ranked as before.

I searched for:
~manufacturer_x ~product_x

This reveals the synonyms for each. Guess what? There are 3 synonyms and they are the same for all my sample keywords - both the keywords that survived Austin and the keywords that were killed.

If Sid's hypothesis were true, then I would have expected there to be no synonyms for the keywords that were unaffected by Austin, because that would explain why the old algo still applies to these keywords. At the very least I would have expected to see different synonyms.

One thing more - my pages contain plenty of the synonyms. They are natural language pages, not keyword-stuffed.

Net_Wizard

4:28 pm on Feb 18, 2004 (gmt 0)

That's what I have posted before. Google semantics are way off target that content pages like yours with natural related words can be derank.

Most content pages are written to be very specific in terms of theme and subject, whereas this LSI algo covers a very broad scope in determining words relationship which is in detriment to specific content pages.

It's like associating cars to nuts and bolts, still related but way way too far related meanwhile, tire or steering wheel is ignored.

[edited by: Net_Wizard at 4:29 pm (utc) on Feb. 18, 2004]

Hissingsid

4:28 pm on Feb 18, 2004 (gmt 0)

Hi Landmark,

I've read your post three times and I still don't really know what you are saying.

Are you saying that on pages that were dropped the same three synonyms occur that are also on pages that were not dropped? There were certainly other things that happenned at Austin other than changes in the semantics algo which may account for your dropped pages.

I've been priveledged to see a number of examples where people have assumed that semantics is the way to go who have slowly risen up SERPs as a result of their take on the implementation of this. The CIRCA Ontology is not just about synonyms its about word associations.

Best wishes

Sid

Hissingsid

4:33 pm on Feb 18, 2004 (gmt 0)

Hi Net_Wizard,

Spot on plus its an American English Ontology.

Googleguy confirmed that they are using semantic indexing and they have now found a better way to implement it in one of the Brandy threads.

An understanding of LSI Ithink helps you to undersatnd the processes a variant of which is being used for part of the algo now.

Best wishes

Sid

marin

4:57 pm on Feb 18, 2004 (gmt 0)

<The CIRCA Ontology is not just about synonyms its about word associations>.

All words have meaning, and different words have various degrees of similarity and distance of meaning.
Very good paper here : [semioticon.com...]

<Let�s see what happens with the other languages... >
Is about cross-language document retrieval [lsi.argreenhouse.com...]

Queries in one language can retrieve documents in other languages (as well as the original language).
This is accomplished by a method that automatically constructs a multilingual semantic space using Latent
Semantic Indexing

Hi Sid

I used your trick ~widgets -widgets in Romanian ( my language ). Surprise : does not work; it allways returns zero results

Maybe google created just small sets of dual-language training documents for non-english;

annej
LSi is able to learn : see here : [patentsearch.patentcafe.com...]

landmark

5:02 pm on Feb 18, 2004 (gmt 0)

Sid said: I've read your post three times and I still don't really know what you are saying.

LOL. I'll try to make it clearer. (It's frustrating trying to give examples without being allowed to say what keywords I'm actually talking about!)

Landmark said: I am looking for an explanation for why some keywords were affected, whilst others are ranked as before.

In your msg #73 you seemed to be proposing an explanation for this, i.e. that Google's semantic dictionary was expanded at Austin. My point was that I can't relate this to the set of search terms that I examined using Google synonyms.

There must be a reason why certain search terms were selected at Florida and Austin.

Maybe your theory is correct, but I don't think it can explain how the list of search terms was selected. Maybe that was selected on other grounds.

rrl

5:10 pm on Feb 18, 2004 (gmt 0)

The discussion about discarding terms that appear on every page had me intrigued, so I combined a couple of sites out of curiosity. I didn't do a thing to either site other than combine them and do some linking to avoid orphan pages. Both sites popped up #1 for their terms again and the terms are competitive.

landmark

5:18 pm on Feb 18, 2004 (gmt 0)

You know, I really want to believe this theory. I'm not trying to be negative. A good theory can stand up to criticism - it's what distinguishes a good theory from a bad theory.

One thing that a lot of people noticed after Florida & Austin was that you would search for "new york widgets" and up would come a page on an authority website that mentioned, "Jo Mo from New York told Sarah Widgets that ...". Such a page is semantically distant from the real topic and should have been penalised by LSI.

How do the supporters of LSI counter this?

metrostang

5:56 pm on Feb 18, 2004 (gmt 0)

>>>Ref. Landmark's New York Widgets example.

I too have some good examples of this. After reading the last 4 posts, I went back and had a closer look. For one two keyword phrase, two of the top 5 sites were good examples of this at work. Upon closer examination I found unintended use of some synonyms and related terms for one of the keywords, although the real topic of the page was in another field. Also found one site that was a directory, not of sites but for books related to one of the keywords, but not the other. Two of the authors last name was the other keyword.

After adding a third keyword that really narrowed the field, these sites were gone. It appears the new alog is treating each keyword as a single search and then combining the results. In some areas this works and it others due to different meanings and usage for many words, you get unintended results.

Hissingsid

9:03 pm on Feb 18, 2004 (gmt 0)

There must be a reason why certain search terms were selected at Florida and Austin.

The terms selected for Florida where the top money terms at the top of the Adwords suggestion tool. Probably a coincidence but that's the way it looks. For example my $1 term got hit in Florida in Austin my 50c three word terms and secondary two word terms got hit. Its like they selected the top of the pyramid got selected first and then they moved down a layer at Austin.

I still think that they are using the same technology as part of the algo in Brandy (what the hell has happened to Brandy?) but they seem to have added back in some ingredient that brings back the pages that got dropped but which should not really have been. Thank goodness! How they have arrived at which terms to turn the dial back on I don't have a clue I'm just glad I sent GoogleGuy a detailed explanation of the problems I was having and some incredible anomolies. Like on WWW now (still waiting for Brandy to come accross here in the UK) if I search for modular widget finance I'm at about #50 but if I search for modular finance I'm at #1. This is repeated time and again for three word terms including the word widget. Now whatever was turning the word widget into a poison word or a discard word has been removed for this particular term.

I didn't mean any offence by the way. I just couldn't visualise what you were trying to say.

This hypothesis I'm pedalling has a few very good virtues if you implement it, Its a bit like going to bed early or not drinking or smoking. It certainly won't do you any harm and if its right it will do you a lot of good. Win<>Win

Best wishes

Sid

This 150 message thread spans 5 pages: 150