Latent Semantic Indexing

Forum Moderators: open

Message Too Old, No Replies

Latent Semantic Indexing

marin

12:16 am on Jan 11, 2004 (gmt 0)

Please do not kill me because I remember you the semantic approach, but I definitely believe we must consider the LATENT SEMANTIC INDEXING [webmasterwoman.com]
theory; this put together things like semantic and stemming, and explains way our singular keywords are not affected

Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original

[edited by: tedster at 7:11 am (utc) on May 10, 2007]

steveb

9:37 pm on Feb 12, 2004 (gmt 0)

"If LSI was applied as a second step no page should come up in the top 10 with the content of the biography of president Bush."

Lots of folks have posted over the past couple months thinking that ONE thing is going on here. An algorithm is many things. A page can easily come into the top ten without having any value in one algo area, like here. Sheer volume of anchor text can overwhelm everything else. In uncompetitive areas the only page on the Internet with the query in the page title will often/usually beat pages with the query on the page. There is no one thing at play here. There are many. Some are more important than others, but the others do exist and can be the deciding factor sometimes.

rfgdxm1

10:02 pm on Feb 12, 2004 (gmt 0)

>I think the "miserable failure" thing is a red herring... people will probably always be able to Google-bomb (or any search-engine-bomb: Altavista and MSN return Bush's biography #2 and #1 also) for a phrase like "miserable failure" that has zero websites already devoted to it. I could probably Google-bomb Bush's site for "chartreuse platypuses" with ten links tomorrow, because that phrase doesn't even exist anywhere else on the web right now.

And, I really wouldn't consider it appropriate for Google to edit SERPs based on political issues. The reason why Bush is #1 for "miserable failure" is a lot of people hate the guy. If that is their opinion, so be it, and Google shouldn't interfere. By the same token, if lots of Bush supporters link to his biography with "great leader", and it comes up #1 on Google for that search, Google shouldn't tamper with that SERP either.

rfgdxm1

10:10 pm on Feb 12, 2004 (gmt 0)

>The tilde search thing is HOT. I just did it for a few of my keywords and it seems to give significant insight into why the top sites for my keyphrases are there...

That tilde search works odd. I just checked that using a 3 letter acrostic that is a shorthand term for the generic name of a certain pharmaceutical. Using that 3 letter acrostic, the generic name is also highlighted. However, a search on the full generic name doesn't show the acrostic highlighted. And, if you search "CIA" and "Central Intelligence Agency" using the tilde, in neither case is the other highlighted.

annej

12:07 am on Feb 13, 2004 (gmt 0)

Flicker, I'm not sure if anyone mentioned this. If your hypothetical "Martian War" page is really about a war did you use any words like 'dead' or 'tragedy'. Adsense has a list of words like this that triggers the service announcements instead of comercial links. I have two articles on my site discussing two different wars and both showed the PSAs so I just took adsense off those two pages.

annej

12:23 am on Feb 13, 2004 (gmt 0)

>>Sheer volume of anchor text can overwhelm everything else.<<

I've noticed this too. Google doesn't seem to be looking beyond the anchor text at the sites of the inbound links as many are still unrelated sites.

Yet other search results seem to be affected by this new algo.

flicker

12:40 am on Feb 13, 2004 (gmt 0)

annej, no, it's a good thought, but it's not literally about a war. That was just an example on my part. (-: It's an educational topic no different from many dozens of others on the site except for the name coinciding with the big city.

I don't, however, know if something was done differently with the advertising on that page somehow (typo in the ad code or something). The difference just struck me after what I saw other people saying about ~cityname -cityname failing, and so on. (-:

TheWhippinpost

2:02 am on Feb 13, 2004 (gmt 0)

Bobby #45:

It would be useful to get feedback from other webmasters' experiences with subtle linguistic changes they have implemented and how it has affected the SERPs.

annej has posted [webmasterworld.com] a result here after just a few days - Maybe too close to say for certainty what caused the rise though yet.

Hissingsid:

The thing that makes me think that the term is thrown out is two fold. 1. It makes logical sense since we can assume that every page found using the old algo either has the term somewhere in the text ...

I agree; "expected" KW's are still given weight, which gets "lighter" the more common they are (and discarded), the distillation process still "revolves" around these words as they "scope" out for further unique words - the unique words sort of "affirm" a subject and offer related docs to the search query.

Difficult, without the proper thinkin time, to put into words but essentially, once you have the search query from the user, one really only needs to FIRST collect all the docs that contain that KW, once that's done you can forget about the word almost, and spread your tentacles out to find the next most important "expected" word and so on... each step renders less value to the last as one get's "hotter", or rather, the subject matter - rather than the keyword "matter" - becomes more relevant.

Makes sense to me, especially on keyphrase, or multi-word search queries.

caveman

2:53 am on Feb 13, 2004 (gmt 0)

Then there's the question of if/how all of the pages within a given site are related to each other in this regard...especially wrt the homepage. ;-)

258cib

3:06 am on Feb 13, 2004 (gmt 0)

steveb said

Lots of folks have posted over the past couple months thinking that ONE thing is going on here. An algorithm is many things. A page can easily come into the top ten without having any value in one algo area, like here. Sheer volume of anchor text can overwhelm everything else. In uncompetitive areas the only page on the Internet with the query in the page title will often/usually beat pages with the query on the page. There is no one thing at play here. There are many. Some are more important than others, but the others do exist and can be the deciding factor sometimes.

Exactly. And, Google and any other search engine using this is going to have priorities. Wigets gets millions of searches, so the algorithm here is very complex. Tung is very specialized, so it's not so sophisticated.

About the only thing you can do--but it's a lot--is study what pages are getting high listing on your search terms. Google tries to create a level playing field focused on the reader. Now, how does their software view this goal? Once upon a time, it was lots of people linking to the site--and that's a factor, still. Now, it's also complete sentences, perhaps? Key phrases?

If Google is a public company, they are going to be focusing on areas that make their stockholders the most money. That's OK--commercial speech has real value in a social sense. But, those areas with a lot of traffic/high bids could have more complex algorithms (I am not implying unfair or not reader focused at all) than other, less well traveled areas.

idoc

3:57 am on Feb 13, 2004 (gmt 0)

The old algorithm is basically intact it seems because searches for non-competitive and otherwise obscure terms ie 'miserable failure' still are returning serps like before. Since the Florida update it has been widely assumed that some ontology filter was being applied to the regular result set... ie 'keyword +a' returns similiar to old results because G evidently treats this term differently than 'keyword'. Austin merely increased the token set it seems.

What I find very interesting also is the combination of the synonym tool with the -exclusion operator. '~city -city' and '~state -state' return no results at all meaning there are no tokens for these words... which makes sense. Try any keyword that someone would or more likely already has bought an adwords ad for... Use the '~keyword -keyword' and the result set will highlight synonyms. Scroll through 200-300 of serps though and you will find just 3 to 5 total stemmed synonyms on average. It would appear that either this method is not telling us everything we don't know about the synonym/token list or... maybe it is and maybe that is the problem. I don't assume this semantic factor has replaced the old algo, just that this could now determine the damping factor for the basic old algo.

steveb

7:52 am on Feb 13, 2004 (gmt 0)

"ie 'keyword +a' returns similiar to old results because G evidently treats this term differently than 'keyword'"

You apparently haven't done this search in the past days.

The results are nowhere near anything resembling "old results". The results favor sites that are titled using the word "a". In other words, they are completely different than anything else.

"Since the Florida update it has been widely assumed that some ontology filter was being applied to the regular result set"

No one paying attention has assumed this. The first thing a person needs to know about Florida/Austin as this statement is wholly untrue. This is a new algorithm, a new ranking system not related in any key way to the old.

Hissingsid

8:42 am on Feb 13, 2004 (gmt 0)

Then there's the question of if/how all of the pages within a given site are related to each other in this regard...especially wrt the homepage. ;-)

Hi Caveman,

Could you expand on that thinking a bit.

Many of us noticed that pages that ranked highest and were index.html at the domain root dropped the furthest. The SERPs filled up with inner pages from directories and from larger sites on the broad subject with thos specific pages touching on the specific topic of the search.

In my SERPS In the top 50 there are only a couple of root pages listed and both of thos are supported by an inset listing of a page on the specific topic almost as though general root page supported by specific inner page has weight.

Best wishes

Sid

Hissingsid

9:06 am on Feb 13, 2004 (gmt 0)

No one paying attention has assumed this. The first thing a person needs to know about Florida/Austin as this statement is wholly untrue. This is a new algorithm, a new ranking system not related in any key way to the old.

Hi Steve etal,

The way I see it is an algorithm is built up of components. There are many components in the new algorithm that are the same as the old algorithm, really there would have to be because there are only so many variables on the page and in linking structures that can be analysed. The things that can be analysed remain the same the way that they are analysed has changed.

The contribution of each component can be increased or decreased and some things added and others taken away. It seems to me clear that some of the key components of the old algo are still definitely there, anchortext, PR etc and something else has been added. CIRCA being a special kind of LSI is in my view the #1 candidate for this new component. Everything else that Google is playing with since it aquired Applied Semantics points to this. Expert opinion suggests that LSI would be incredibly inefficient be applied to a corpus of 3.3 billion with regular updates because of the complexity of the calculations required.

The smart thing to do would be to apply it to smaller samples.

So how do you get smaller samples? Well you get results sets from Googles index.

How do you decide which results sets to create to apply the analysis to? You compile a list of the most frequently used search terms.

And what do you do when you want to expand that compiled list? You expand it to related frequently used terms. (Sound like Austin anybody?)

This addition of LSI/CIRCA to the existing algo would look different enough for it to appear like a whole new algo without changing what Google previously had. If you accept this 2 stage process this explains why some terms were affected and some were not and why more were added at Austin. If they are not doing this then they need to get the thing patented quick and start using it because it is killer search technology.

Best wishes

Sid

hanan_cohen

9:28 am on Feb 13, 2004 (gmt 0)

I don't remember if anyone mentioned it here, but searching for ~travel -travel does not affect the AdWords.

I think it's an important fact but I cannot find the exact reason why it is so.

Hissingsid

9:37 am on Feb 13, 2004 (gmt 0)

I don't remember if anyone mentioned it here, but searching for ~travel -travel does not affect the AdWords.
I think it's an important fact but I cannot find the exact reason why it is so.

Hi Hannan,

Adwords can/does use broad match to choose the ads to present. Since the terms returned are the closest possible match to the ~term -term then it makes sense that these will be perfect brad matches. In fact maybe that's what they mean by broad match.

Best wishes

Sid

idoc

12:46 pm on Feb 13, 2004 (gmt 0)

Sid,

"This addition of LSI/CIRCA to the existing algo would look different enough for it to appear like a whole new algo without changing what Google previously had."

Exactly, I think alot of folks have really made more of this than it is. It's a "candy coated" filter of sorts applied on top of the old google. That's why everyone in academia is still enamored with google results... they are for the most part the same. Yes they changed a bit as would be normal with two old updates, but they didn't throw out the old. Also, it's why everyone... well practically everyone who optimizes for commercial purposes in competitive areas is in upheval.

Hissingsid

4:20 pm on Feb 13, 2004 (gmt 0)

Hi,

It just struck me that since LSI distils down a page to s single vectro it would make an excellent dupe filter. Two pages with exactly the same vector are the same.

Which perhaps explains one or two other anomolies eh?

Best wishes

Sid

zgb999

4:32 pm on Feb 13, 2004 (gmt 0)

It might be a part of their dupe filter but it would be too easy to cheat.

From the patents... I have seen it looks more as if the dupe filter takes several fingerprints from a page.

Hissingsid

5:42 pm on Feb 13, 2004 (gmt 0)

From the patents... I have seen it looks more as if the dupe filter takes several fingerprints from a page.

The point of LSI is that it distills all of the "fingerprints" from a document and uses than to calculate a single vector line.

Best wishes

Sid

metrostang

8:14 pm on Feb 13, 2004 (gmt 0)

If there was a reference to this earlier, I missed it.

There is a thread in Keyword Discussions that started in March of 2001 on Stemming and keyword "families"
on the future of stemming, LSI and the use of categories.

It's worth a read. It talks about a somewhat different approach to the same technoloy. Also has links to some good articles.

[webmasterworld.com...]

annej

3:57 pm on Feb 14, 2004 (gmt 0)

annej has posted a result here after just a few days - Maybe too close to say for certainty what caused the rise though yet.

I hadn't made any other changes for months other than changing my new articles each month on the upper left hand side. I have been sitting on #11 in the serps when searching the word 'widgeting' for months. Since sometime after Dominic as I remember. I had crept up to #8 a few weeks ago, and then to #7 and today it's at #5. I've got to be suspicious that it's the changes I made.

All I did is include some of related words that I found using ~widgeting. I didn't add a lot of words, just looked at where I could change words without changing meaning. Like using widget making instead of widgeting and widgetwork instead of widget.

It looks to me like Google is finding pages with more depth on the topic this way. It will be interesting to see if my new results last. I'm curious to see if this works for other people as well.

annej

3:04 am on Feb 15, 2004 (gmt 0)

Forget the above. I see that Google is dancing so the results are dancing all over the first page.

humpingdan

11:28 am on Feb 17, 2004 (gmt 0)

[scientificamerican.com...]

not latent but a little more on semantic helped me out!

incywincy

12:46 pm on Feb 17, 2004 (gmt 0)

i don't mean to be dumb here but i thought that since the inception of seo it was good practise to sprinkle key words and related terms across your optimized pages. that is don't rely on a single keyword combination.

if you've done this won't you rank well regardless of LSI/non-LSI ranking algorithms?

Scarecrow

1:12 pm on Feb 17, 2004 (gmt 0)

The scholars who write these pristine little papers on techniques such as MDS (multi-dimensional scaling), use pristine little data sets to illustrate their points. Such is the case with the paper cited at the beginning of this thread, which uses MDS to illustrate their technique.

In fact, they fail to mention that the larger the data set, the more imperfect and unpredictable the results. Consider trying to plot the relationships between 100 entities on a computer screen. You have 4,950 unique pairs with 100 entities. You need n-1 or 99 dimensions to plot it perfectly.

Don't believe me? Plot three points on two dimensions, where all three points are equidistant. No problem -- you have three points of an equilateral triangle. Now add a fourth point, such that all four points are equidistant. You cannot do it on two dimensions. The fourth point has to be placed behind or in front of the screen. You now need three dimensions, or n-1.

When was the last time you were able to visualize even four dimensions? Our brains don't do very well in this area. You can set up a matrix chart with numbers and show all the data, but trying to reduce it means that you have to start cutting lots of corners.

MDS is frequently illustrated by scholars by taking a road map chart that shows the driving distance between cities. They take these numbers and do an MDS plot. The map comes out pretty good, although it might be a mirror-image or upside down.

What they don't tell you is that the reason it comes out okay is that the map was a mere two-dimensional situation to begin with. It's a stacked deck! In the real world, MDS will almost always be trying to plot more than two dimensions. The whole thing goes downhill very rapidly with MDS, at the same time as the crunching required goes up exponentially.

I use MDS as an example, because I have extensive experience with it from plotting 100 points on a screen. But it's the same thing with all of these fancy techniques. You have a very complex, multi-dimensional problem, and you have to reduce it to the top ranked results. Whether it's 10, or 20, or 100 in the top rank, I can assure you that the end product will leave you with that "filter feeling" we got from Florida and Austin.

Since the end product is so unsatisfactory in any case, and since the data sets involved are so vast, it makes sense to revert to something simpler, as opposed to throwing more sophisticated algorithms at it. I think Google will figure this out eventually. Good old word proximity, when your search terms involve more than one word, is pretty simple, has low overhead, and is probably more useful than all these fancy algorithms put together. I'm not saying that word proximity alone is sufficient, but I use it as an example because I believe that Google is using it less now than it used to.

If Google is using lots of fancy stuff, then Google has gone too far. You'll burn your brains out trying to optimize for it.

Hissingsid

1:50 pm on Feb 17, 2004 (gmt 0)

use pristine little data sets to illustrate their points. Such is the case with the paper cited at the beginning of this thread, which uses MDS to illustrate their technique.

Hi,

That's the point. I'm almost certain that Google is doing this in a stepped process.

Step 1. Select relatively small sample using algo 1. (something like the old Google algo for example)

Step 2. Run CIRCA indexing on it.

Step 3. Combine the results from step 1 and 2 and present in users browser.

The neat trick is selecting a smaller sample to work with. Thats why I think we always end up with a maximum of 1000 results in SERPs. Its a kind of blinding flash of the bleeding obvious.

Best wishes

Sid

barbeta

1:54 pm on Feb 17, 2004 (gmt 0)

Scarecrow... you can use a multidimensional matrix to solve the problem, and then use vectors on it. That solves the problem and of course, gives you the more accurates results on huge data sets.
(obviously, you can�t see it on screen)

g1smd

8:37 pm on Feb 17, 2004 (gmt 0)

>> You'll burn your brains out trying to optimize for it. <<

Great, maybe you can all go back to writing great content for your visitors, and forget all about keyword density, pagerank, and all those other distractions.

258cib

9:48 pm on Feb 17, 2004 (gmt 0)

Scarecrow is wise.

Yeah, great content is, indeed, important. But, the only way you're going to know what G or other search engines think is great content is by looking at what they select. And, you might not agree with it.

Or, in some cases they'll use x and others y.

We really need some help, for example, with city, state searches. But, who is to say what is great content in this kind of search. Hotels often come up. Well, maybe that's what most are looking for? I don't know. Florists are big in many city, state searches. Weird? No, not really in the overall scope of things, if you have to pick just 10 things.

When you only have a couple of words, there is all kinds of room for misunderstandings. So, categorizations will eventually come into play, I predict.

SyntheticUpper

10:00 pm on Feb 17, 2004 (gmt 0)

A fascinating post by Scarecrow. One of the relevant papers uses the simple 3D analogy of 'Breakfast' in the XYZ plane: Bacon, Egg, Coffee.

And of course the maths can be extended into as many dimensions as required. But with dozens of potentially relevant words on each page, it always struck me that it was unlikely any search engine would have the processing power to analyse the billions of pages out there.

Google must be using shortcuts to do this. The subjective success or failure of the algo must depend upon the mix of shortcuts (word count/proximity etc.) to this more sophisticated stuff.

Perhaps they got this mix wrong with F and Austin. A reasonable precis Scarecrow?

This 150 message thread spans 5 pages: 150