Latent Semantic Indexing

Forum Moderators: open

Message Too Old, No Replies

Latent Semantic Indexing

marin

12:16 am on Jan 11, 2004 (gmt 0)

Please do not kill me because I remember you the semantic approach, but I definitely believe we must consider the LATENT SEMANTIC INDEXING [webmasterwoman.com]
theory; this put together things like semantic and stemming, and explains way our singular keywords are not affected

Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original

[edited by: tedster at 7:11 am (utc) on May 10, 2007]

glengara

10:03 pm on Feb 11, 2004 (gmt 0)

Must say I find the possibility of using Semantics to grasp the meaning of a page a hugely exciting area.
IMO we're looking at the basis of the future on-page relevance ranking method here, and largely immune to our traditional on-page "optimization" ;-)

skippy

10:20 pm on Feb 11, 2004 (gmt 0)

Ok, this makes sense. It would explain why portals and directories are ranking so high. Also one of the keywords I follow is a term mostly restricted to the United States with light usage in the UK.

However a synonym is heavy used in the UK and lightly used in the US. Right now the SERPS are flooded with UK results. So the question is are some synonyms more powerful then others?

Hissingsid

10:20 pm on Feb 11, 2004 (gmt 0)

If the keyphrase is weeded out of the analysis on the grounds that all pages in the set contain it, then no weight can be given to having keyphrase in the title.

I'm actually suggesting that CIRCA is an extension of what is described in the LSI paper in that it may exclude the actual term searched for, if that term is one which Google has earmarked for special consideration.

Pages that are an authority on the broad topic would then be given a boost because they are then only judged on words that are a close semantic match.

I'm sure that its not this simple and some of this may well be very wide of the mark but if we assume that it is right there is a clear way forward. If we are wrong then it will not harm you to broaden your language using a well researched controlled vocabulary of words that are close semantic matches for the term you are aiming for. Its a win win situation.

Best wishes

Sid

PS I think that we owe Marin a big vote of thanks here I know that he started this thread after considerable research and a wish to share his findings with us all.

PPS The thing that makes me think that the term is thrown out is two fold. 1. It makes logical sense since we can assume that every page found using the old algo either has the term somewhere in the text or in links pointing to the page. (perhaps when they have this working right it will stop Google Bombing as a result). 2. The +www or +a search that brings sites that used to be #1 back to the top!

Hissingsid

10:26 pm on Feb 11, 2004 (gmt 0)

[quote]However a synonym is heavy used in the UK and lightly used in the US. Right now the SERPS are flooded with UK results. So the question is are some synonyms more powerful then others?[/quote]

Semantics isn't just about synonyms its about a lot more than that. Google (Applied Semantics) has a thing called an ontology. This is a kind of sophisticated computer based dictionary that describes the weight of links between words. Not just synonyms but other words too. I'm sorry if I mislead you with the synonym search thing. I do think that this is an important tool but there are other ways of finding what google thinks are closely realted words. Just look at the top three sites in your SERPs with a very critical semantic eye. Work through the discard process in the LSI paper.

Best wishes

Sid

I'm off to op-en a bottle of wine.

valeyard

10:32 pm on Feb 11, 2004 (gmt 0)

Sid,

I'm sure that its not this simple and some of this may well be very wide of the mark but if we assume that it is right there is a clear way forward. If we are wrong then it will not harm you to broaden your language using a well researched controlled vocabulary of words that are close semantic matches for the term you are aiming for. Its a win win situation.

Win-win for me the "SEO" perhaps. Not win-win for me the searcher who can't find what he wants because Google seems to be ignoring TITLE tags and presenting me with irrelevant pages and directories.

This new algo combination seems to be the worst of both worlds. It provides the user with inferior results yet we webmasters will still be able to manipulate it. Security through obscurity never works.

PS I think that we owe Marin a big vote of thanks here I know that he started this thread after considerable research and a wish to share his findings with us all.

I second that. Thanks, Marin!

skippy

10:36 pm on Feb 11, 2004 (gmt 0)

I don�t know about mislead but I understand that I can not just stuff synonyms on my page. I do agree that this a very exciting development and interesting too.

And I will third the Thanks

flicker

11:23 pm on Feb 11, 2004 (gmt 0)

What's wrong with directory pages, anyway? I'm much happier to see directory pages in my results than spam. In fact, sometimes the directory pages are the most useful results for me. Not so much the ODP and Yahoo directories, seeing as how I know where both of them are already; but the niche directories, specialist directories, and directories with reviews on them are often invaluable when I'm searching for something in-depth. I welcome them in my SERP's.

Maybe it would be nice if there was a search engine that gave you little boxes to check in the advanced search, such as "exclude directory pages," "exclude shopping sites," and "exclude affiliate sites" as well as the usual "exclude porn." It sounds like a lot of people on here would like to be able to search without the kind of results they favor less getting in their way. (-:

metrostang

3:21 am on Feb 12, 2004 (gmt 0)

What I'm seeing in searches is starting to make sense if they are in fact using this. For 18 out of 20 two word combinations using ~keyword1 keyword2 -keyword1 keyword2, results do indeed show other related words that would be relevent to like pages. These are all in one industry with one of the keywords the same.

The results are all what you expect to see with no off topic pages except for two of the combinations. They come back with no results at all.

These are the same two that return horrible results if you search normally. Eight of ten on a page are bogus. Could it be that this is a work in progress and those two sets of keywords are not in the system yet, so the returns are based strictly the keyword being present.

This would partially explain some of the search results we are seeing.

TheWhippinpost

3:39 am on Feb 12, 2004 (gmt 0)

Nice find marin, wish I'd have read that a few days ago, this threads [webmasterworld.com] findings about "related" words now makes a lot more sense.

Really interesting, cheers... am off to read hissingsid's link.

BTW... links and PR could still play a part by helpin G for instance, to determine an "authority" document to (cross) reference from?

Kirby

3:51 am on Feb 12, 2004 (gmt 0)

>BTW... links and PR could still play a part by helpin G for instance, to determine an "authority" document to (cross) reference from?

Absolutely. This returns the original purpose of PR, links and anchor text to the algo. The proof will be the success or failure of Google bombing.

steveb

5:12 am on Feb 12, 2004 (gmt 0)

There are a bunch of these "~" searches that are very interesting.

~keyword1 keyword1
likes pages titled where the related words are immediately before the keyword in the title

~keyword1 keyword1 is very different than keyword1 ~keyword1

~keyword1 ~keyword1 is similar but not the same as keyword1 keyword1

Except for a straight ~keyword1 search, all these doubled up and "~" searches bring up remarkably good results... particularly finding a lot of content-rich sites that are mediocrely (but not poorly) seo'ed.

Unfortunately, I haven't been able to figure out how to do multiple words.

Also interesting, a word that might show up as a related word -- that is, keyword2 is highlighted for a ~keyword1 search -- may NOT be seen as related in reverse... meaning for a ~keyword2 search keyword1 is NOT highlighted.

McMohan

7:23 am on Feb 12, 2004 (gmt 0)

I wonder why one of my posts on Jan 10th did not kick off the discussions in this direction. Ignorantly I had touched upon this subject, where I had sugegsted Google KW suggestion tool to know what Google thinks as similar words.

Here is the post -
[webmasterworld.com ]

steveb

8:26 am on Feb 12, 2004 (gmt 0)

Another interesting thing, when searching for
keyword ~keyword
the ransom note highlights the key text in
keywordother.com

In other words, it recognizes the keyword within a wordstogether.com domain name.

This may be less interesting though in that this keywordother.com is an extremely common word within the niche, much more common than for example webmasterworld.com... so they are recognizing the "webmaster" part of webmasterworld.com but they may not recognize webmaster in webmasterother.com

Hissingsid

8:58 am on Feb 12, 2004 (gmt 0)

I wonder why one of my posts on Jan 10th did not kick off the discussions in this direction. Ignorantly I had touched upon this subject, where I had sugegsted Google KW suggestion tool to know what Google thinks as similar words.

Hi,

What excited me about Marin's discovery is the fact that it provides a real explanation of the mechanism. Then if you make a few assumptions all of the problems that we have been seeing start to make sense.

Soon after Florida, well by mid December anyway i think we had narrowed it down to being something to do with CIRCA. I went of on a tangent using the Adsense preview tool to check my page themes and then tweaking them until the feedback I got from the ads shown matched my target theme.I think that this has helped but having played with this fairly extensively I think that I can confirm that the Adsense "sensing technology" is related to but not the same as Google search as we are seeing it now. I suspec that the difference is the search term is actually left in for Adsense but I could be wrong. Some folks on the Adsense forum have noted that a single occurance of a term gets ads on that term even though there are many occurances of a more relevant target term, this is a good indicator that LSI may be being used in some form.

I think that the Google Suggestion tool is just one way of gathering related terms. There are lots of other tools out there to give you indicators of linked words and careful disection of the pages at the top of your SERPs is a key way of finding related terms.

Although I am certain that this is the answer I'm still having problems successfully implementing it on an existing site.

Best wishes

Sid

Bobby

9:25 am on Feb 12, 2004 (gmt 0)

Although I am certain that this is the answer I'm still having problems successfully implementing it on an existing site.

Sid, I think it all boils down to just this.
It would be useful to get feedback from other webmasters' experiences with subtle linguistic changes they have implemented and how it has affected the SERPs.

marin

9:56 am on Feb 12, 2004 (gmt 0)

Hi Bobby,

It works for me; my index is very stable - top10 -after Austin for my main keyword1 keyword2.
IMO finding the really related subjects and building content for them is the way to become an authority in gg.
Maybe this tread must be merged with this one

[webmasterworld.com ]

tantalus

10:27 am on Feb 12, 2004 (gmt 0)

"...it recognizes the keyword within a wordstogether.com domain name."

Thats an interesting point SteveB.

When I searched for Keywordword (which is a word in itself) using the ~ - formula no results were found, although a normal search would return 7m odd. I found that surprising.

There does seem to be something happening with strings although I can't put my finger on it.

Perhaps the art of keyword embedding will back.

webdude

10:48 am on Feb 12, 2004 (gmt 0)

Bobby,

These concepts seem to be working for me.

jkuest

2:20 pm on Feb 12, 2004 (gmt 0)

in Germany we have only be hit by Austin. I can confirm those theories. I used the Google Adwords suggestion tool which gave me exactly the keywords my site was optimised for....and those pages were the first to go. Since the number of keywords (german) is still limited, it is easy to spot. I assume this keyword data is from adwords. The semantic pages survived the blow or came up. Thanks for this enlightning thread, gives the direction.

zgb999

2:26 pm on Feb 12, 2004 (gmt 0)

For me LSI/CIRCA still leave a lot of open questions:

- will this lead to more content being "stolen"? If LSI preferrs the word combinations it find on the top ranking page I could just copy that page (maybe altering the company name) and this might bring me to the top.

- how is the impact on languages like japanese? To my understanding there is no such thing as pronoun... there. From those languages we might learn more details about the algo. (See also
[webmasterworld.com...]

- is LSI mainly used to rank the pages or to decide which backlinks are to be given what value? If it really is used for the ranking of the pages directly then why is Google bombing still working (miserable faillure is still up)? Would Bush's biography disappear from SERPs if they deleted the sentence "We will not tire, we will not falter, and we will not fail." (mark the last word...)

flicker

2:59 pm on Feb 12, 2004 (gmt 0)

I think the "miserable failure" thing is a red herring... people will probably always be able to Google-bomb (or any search-engine-bomb: Altavista and MSN return Bush's biography #2 and #1 also) for a phrase like "miserable failure" that has zero websites already devoted to it. I could probably Google-bomb Bush's site for "chartreuse platypuses" with ten links tomorrow, because that phrase doesn't even exist anywhere else on the web right now.

Anchor text *should* have *some* weight. As long as I can't make Bush's site come up #1 for a search on "John Kerry" or "Texas real estate," then Google is doing its job. IMHO.

Hissingsid

3:18 pm on Feb 12, 2004 (gmt 0)

Anchor text *should* have *some* weight. As long as I can't make Bush's site come up #1 for a search on "John Kerry" or "Texas real estate," then Google is doing its job. IMHO.

Absolutely spot on.

You can easily Googlebomb a term that is rarely used because the main factor by far becomes the anchor text there is little else competing with it. I still think that in order to be analysed by CIRCA the term used has to be in Google's Ontology and in order to get into the Ontology the term has to be worth something either intelectually or commercially.

Its simple really if the term isn't in the Ontology how can the LSI algo know what is a related word? If it doesn't know wht is related to it how does it know what to look for?

Its just too wasteful to develop semantic maps for search terms that are never used. The 80/20 rule suggests that Google would/is concentrating on the most frequently used terms. The guy who wrote the paper on LSI that Marin quoted at the start of this thread told me that hedoubts Google is using LSI because it is very hungry in processing requirements. He was of course thinking about a sample of 3.3 billion but if you reduce this to a sample of say 5000 found using the old Google algo and only sampled the terms most often used then it is easy to see how it could and is being used. That is what I think Google bought with Applied Semantics Inc, the trick of doing the analysis on small samples.

Best wishes

Sid

zgb999

3:59 pm on Feb 12, 2004 (gmt 0)

Miserable failure brings up 251,000 pages so it is not some meaningless words that no one has on their pages.

IF LSI is used in the way Sid describes then Google bombing would be very difficult.

In the first step (pre Florida ranking) it might be number 1. But in the second step (LSI) it would be burried and wouldn't be near the top.

Both words (miserable and failure) bring up similar word in AdWords Keyword Suggestions. So if you think that AdWords Keyword Suggestions are matching Google's Ontology then LSI would be applied to that search term.

CIRCA is certainly much more than LSI and they proved that it worked at Applied Semantics.

By the way I do think we are up to something. GoogleGuy has answered many questions since this thread is up. More than in previous weeks... Coincidence? Maybe, but maybe he was around seeing where this thread was going to...

Hissingsid

4:37 pm on Feb 12, 2004 (gmt 0)

Miserable failure brings up 251,000 pages so it is not some meaningless words that no one has on their pages.

If you look at the SERPs its hard to find a page that isn't refering to Googlebombing except of course the first couple ;)

Which came first the chicken or the egg?

The focus of CIRCA is the Ontology that they have developed. Bearing in mind where Applied Semantics Inc grew from It would not be beyond the bounds of safe assumption to think that their starting point in developing the Ontology was commercial and frequently used search terms.

There are many terms that are widely used on web pages that will only be searched for once in a blue moon. I can pick two words from my own site find over a million results and feel very smug that I'm #1 but no one will ever use that term so my top ranking is worthless.

Before the Google bombing story who would have searched for Misearable failure other than some poor b**tard looking for another one. :-)

Best wishes

Sid

zgb999

5:51 pm on Feb 12, 2004 (gmt 0)

My point was that "misearable failure" would be an example of a SERP that should be influenced by LSI (as the words are most probably in Googles Ontology).

So it doesn't matter how many pages there were before. If LSI was applied as a second step no page should come up in the top 10 with the content of the biography of president Bush.

If Google should use a different Ontology for SERP than for Adwords then I could agree with you.

Hissingsid

6:26 pm on Feb 12, 2004 (gmt 0)

So it doesn't matter how many pages there were before. If LSI was applied as a second step no page should come up in the top 10 with the content of the biography of president Bush.

I understand your logic but...

It depends on how much LSI is used to influence the final SERPs. It could be argued that for a none competitive term using Googles old algo anchor text was by far the most important element in the algo so Bush's bio could be way in the lead over other pages. Then if you add in the LSI element perhaps other pages can't catch up even with the current level of boost. I see Bush's bio is no longer #1. Also you can't rule out the fact that words used in the bio like "war" and "terrorism" might be close semantic matches to failure.

Also remember that there was a step change in January when a far wider range of terms were affected. Since this hypothesis that Google is using CIRCA as part of the algo relies on the selection of a sample first, using the old algo, it is very easy to see how only selected terms would be used. In fact the logic in this adds weight to the suggestion that it is indeed a two stage process.

Miserable failure is a red herring. Virtually every hypothesis regarding Florida and Austin (OOP, filter etc) is answered by CIRCA applied to a sample. You can't disprove the hypothesis using the Bush Google bomb as an example because there is almost certainly a compiled list of terms to which the new algo is applied and if I were Google I would make sure that miserable failure and Search engine optimization were not on that list.

Best wishes

Sid

badtzmaru

6:48 pm on Feb 12, 2004 (gmt 0)

The tilde search thing is HOT. I just did it for a few of my keywords and it seems to give significant insight into why the top sites for my keyphrases are there...

rharri

7:57 pm on Feb 12, 2004 (gmt 0)

This looks like another technique of "text mining." There are strong groups at Stanford and an offshoot at Berkeley (Marti Hearst comes to mind).

BTW, I may have missed it in this paper but words can also be defined by their occurance next to other words. So a word not in Google's ontology can still be defined if it occurs close to another word that is.

Bob

[edited by: rharri at 9:49 pm (utc) on Feb. 12, 2004]

Chndru

8:05 pm on Feb 12, 2004 (gmt 0)

More evidence that if the old game was hitting targets, the new one is overcoming hurdles....

well said, caveman. This captures the slight shift in the way G thinks. Laser-sharp accuracy traded for better reliability.

this thread developing into a very interesting one

hanan_cohen

9:26 pm on Feb 12, 2004 (gmt 0)

I want to offer a supporting evidence from my corner of the Mediterranean.

I have read the threads here about Austin and didn't understand what were you talking about because my sites were not affected at all. And why? Because they are in Hebrew. It seems like Google wasn't taught Hebrew yet, so this semantics thing does not affect us.

This 150 message thread spans 5 pages: 150