Latent Semantic Indexing

Forum Moderators: open

Message Too Old, No Replies

Latent Semantic Indexing

marin

12:16 am on Jan 11, 2004 (gmt 0)

Please do not kill me because I remember you the semantic approach, but I definitely believe we must consider the LATENT SEMANTIC INDEXING [webmasterwoman.com]
theory; this put together things like semantic and stemming, and explains way our singular keywords are not affected

Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original

[edited by: tedster at 7:11 am (utc) on May 10, 2007]

Hissingsid

9:13 pm on Feb 18, 2004 (gmt 0)

Metrostang,

I beleive that you have it exactly right.

After Florida I started to look at the pages at the top of SERPs for my target terms and they had inadvertently used synonyms and closely related terms in anchor text predominantly.

The CIRCA paper talks about separating the term into tokens (another word for, well word) and analysing those separately. Somehow they have to teach it about terms that mean more as a term than the sum of the two parts. In my case widget finance has only one meaning and it would be better to look for stems and associated terms to that. Ask any person in the street in the UK what widget finance was and they would look at you like you were some kind of moron.

Perhaps thats what they have done at Brandy. ie taught the thing what some simple two word terms mean as a package

Best wishes

Sid

SlyOldDog

9:16 pm on Feb 18, 2004 (gmt 0)

>>Jo Mo from New York told Sarah Widgets that ...". Such a page is semantically distant from the real topic and should have been penalised by LSI.

I missed this whole thread, but I just read the excellent site posted at the start by Bobby.

I noticed that the first thing you do with LSI is discard "padding words" which do not contribute to the document's meaning. Perhaps that explains the problems of dropped sites in Florida with keywords like:

Real Estate
New York

"Real" and "new" are common words which at first glance add no value. Perhaps they were considered padding words in the Florida update and when Google saw webmasters screaming they realized these words should not have been removed.

That would explain the miraculous return of SEO sites in December and recently. As Google tunes which padding words to exclude, rankings will change significantly if those words apply to your site.

valeyard

9:25 pm on Feb 18, 2004 (gmt 0)

One thing that a lot of people noticed after Florida & Austin was that you would search for "new york widgets" and up would come a page on an authority website that mentioned, "Jo Mo from New York told Sarah Widgets that ...". Such a page is semantically distant from the real topic and should have been penalised by LSI.

I think these sort of errors are explained by Google implementing LSI - badly.

I've already made the point about preselection. To avoid applying LSI to the entire dataset (as I believe it was intended to be used) we suspect Google is first pulling out a tiny percentage of pages. How does it select them? Dunno, but being any sort of authority - or simply having good PageRank - probably helps.

So you're applying LSI to a sample already biased and massively incomplete. A page with some random semantic connection to the search term gets ranking it never deserves by virtue of being in the preselection. Had LSI been applied to the entire dataset the dud page would probably have been driven out by pages with far more semantic relevance - but which didn't make the first cut.

Result: SERPS suck.

Semantic indexing can be great if done right. I think the Austin fiasco proves you can't do it "on the cheap".

Or maybe there's no semantic indexing at all going on. Maybe Google really have implemented an OOP and a commercial dictionary to force people into adwords. I prefer the semantic theory if only because, as Sid says, if it's right then at least there's something I can do!

SlyOldDog

9:38 pm on Feb 18, 2004 (gmt 0)

Google would have to keep some off page ranking element in the algorithm or it would be a simple matter of copying the text word for word from the #1 site, and in the next update you would have either the top or second spot.

Perhaps it biases the results and reduces quality, but off page factors reduce the ease of manipulating the results.

caveman

10:15 pm on Feb 18, 2004 (gmt 0)

Sid, I was just skimming back through this thread and saw that you asked a question I never answered. Sorry, missed it.

Regarding the vanishing homepages, without getting into great detail, if you relate your general conclusions to the way most homepages of SEO'd sites are constructed, it's no wonder most of them were blown up, and it makes sense that a lot of small innocent sites would see their homepages hit also.

Meanwhile, directories seemed to flourish. :-)

Nice work on this btw.

Munster

10:38 pm on Feb 18, 2004 (gmt 0)

This is great, how do we optimise for it without making our content suffer?

idoc

11:27 pm on Feb 18, 2004 (gmt 0)

"It appears the new alog is treating each keyword as a single search and then combining the results"

Interesting, I thought about this also... in trying to extract something useful from the new google page style it looked alot like a search for KW1 and then search within results for KW2 returns results alot like an original search for KW1 KW2. Also intersting a search for "KW1 KW2" quoted returns results totally different which also made me wonder whether there are result sets for each keyword... semantic filters applied that are sort of modularly combined for the word pairs. That might explain also the odd results sometimes.

TheWhippinpost

12:34 am on Feb 19, 2004 (gmt 0)

landmark:

search for "new york widgets" and up would come a page on an authority website that mentioned, "Jo Mo from New York told Sarah Widgets that ...". Such a page is semantically distant from the real topic and should have been penalised by LSI.

After conductin a "genuine" search for non-work-related info yesterday, I have to say that I have symapathy for what you're sayin landmark. The results returned were condusive to an LSI-type algo and so therefore, I only had probably around 3 docs that were "on-the-nose" for my query... the rest (on page 1) were distantly related but not absolutely relevant.

We must remember in all this, not to pin too much reliance on believing this current algo is LSI - it's highly probable it isn't - it's just the best explanation of describing what's philosophically happening... whatever variant of LSI is bein used, we musn't also forget it's being used in conjunction with G's other tools/measurements which will inevitably mean different results from a purely LSI-filtered one.

Beachboy

7:30 am on Feb 19, 2004 (gmt 0)

Good call, Webdude.

<<Stuffing KWs is pretty much dead. I would pick several good phrases and try to write naturally to portray what you are trying to say by alternating the phrases with synonyms.>>

I took some time today to sort thru several pages of SERPs in categories I work with that often see very spammy, keyword-stuffed text, nonsense phrases and so on, ranked rather high up. Pages using those techniques are gone, or at least pushed way down in the results.

Looks to me like Google is sorting out "unnatural" language.

Hissingsid

9:23 am on Feb 19, 2004 (gmt 0)

This is great, how do we optimise for it without making our content suffer?

It actually makes content better from a human stand point. Instead of repeating the target keyphrase at the begginning and end of every sentence verbatum you use alternative, perhaps more descriptive phrases for exactly the same thing.

Two ways to find associated terms that add context to your pages are: 1. the ~widget -widget search and 2. Search for your target term and see what meaningful words the top three sites use and where they use them. You're not aiming for semantic perfection just to be semantically one goal better than the other guy.

Best wishes

Sid

<added> TheWhippinpost, I think you have it spot on. Understanding LSI helps us to get our heads round what they are doing at Googleplex now. Its kind of the nearest approximation that we are going to find.

Beachboy, I think that's a good observation of what this actually means.

landmark

9:49 am on Feb 19, 2004 (gmt 0)

So, to optimise a page, if I understand correctly, we are not simply trying to get as many synonyms as possible on the page. What we are trying to do is to make our page semantically as close as possible to the "average" page from G's sample pages, i.e. we want our page to appear in the centre of the cluster of pages on the semantic map.

So, for example, if the top ranking pages mention a particular synonym 3 times, and our page currently mentions it 20 times, we need to reduce the occurence of that synonym on our page.

Right?

Hissingsid

11:10 am on Feb 19, 2004 (gmt 0)

So, for example, if the top ranking pages mention a particular synonym 3 times, and our page currently mentions it 20 times, we need to reduce the occurence of that synonym on our page.

I think that you are going to need to change your approach. Its not just about synonyms its about relationships between words that set the context of the page.

The CIRCA semantics white paper lists the following types of word associations:

Synonymy/antonymy, Similarity, Hypernymy, Membership, Metonymy, Substance, Product, Attribute, Causation, Entailment, Lateral bonds.

It gives as an example a dog is a pet but pet is not a synonym of dog however it does put the word dog in context.

I guess that it is relatively easy to understand what a page is about, what its context is, but it is very hard to understand what a two word search term is about. I think that one of the flaws of the new Google system is the fact that people have learned how to make search engines give them what they want. Through a process of trial and error they find word combinations that bring them the right results. This new Google approach tries to guess what they really mean and since the user has contrived a search in a way that they beleive that they can predict the kind of results they might get then the two sides of this are working against each other.

Until users start entering their search term with a context over rider it will always be far from perfect.

I think that I prefer the refine option given by some other SEs.

Best wishes

Sid

humpingdan

3:40 pm on Feb 19, 2004 (gmt 0)

Looking through all the previous posts and additional links others have posted brought me to
"ontology" - the specification of a conceptualization -

Ontology is a description of the concepts and relationships that can excist for an agent or community of agents.

[www-ksl.stanford.edu...]
Tom Gruber

3rd paragraph alos quite intresting, (eg, semantics independant of reader and context)

just throwing it out there!

What is Google's Conceptualization and what are the specifications of it? <---think were working on that one

[edited by: humpingdan at 3:46 pm (utc) on Feb. 19, 2004]

Net_Wizard

3:40 pm on Feb 19, 2004 (gmt 0)

Something new guys, at least the first time I saw this happening.

For example if you are searching for information about 'Ink' (Inktomi). Obviously Google would return also anything about ink but what is interesting is that it would also highlight anything that have 'ink' on it even Inktomi.

Point is, it just happen that I'm actually searching for Inktomi thus the return is relevant. However, how about if I'm actually searching for 'Ink'? Wouldn't 'Inktomi' irrelevant to my search then?

Now, consider the impact of this on the serp, I'm sure that there are plenty of words that can be separated into two or more words but when broken up doesn't necessarily relate to each other.

Say for example...hotdog...would this word also come up under 'hot' and 'dog'?

What you think?

SyntheticUpper

3:55 pm on Feb 19, 2004 (gmt 0)

It's interesting that the semantic idea is forced to find word connections using data based on what many people search for / what most people write in webpages.

This raises a few concerns: (forgive the following sweeping statements - they're illustrative): but what most people eat are hamburgers, what most people search for is Britney, what most people buy etc. are...

There's a worry that what might be dished up is what 'most people' expect to be dished up (or most webmasters expect most people to be dished up!)

It's pretty clear, for example, using the tilde technique, that Google regards C*non (the company) as a synonym for camera. Well that's o.k. - it's based on what most people / sites connect with camera - but it's of little use for someone who wants a Hass*lblad, and potentially disastrous for a specialist camera firm that doesn't sell the C*non make.

During the worst excesses of Fl/Austin this is what probably hit the micro sites - you might be the best most specialist widget seller in the World - but without these artificially connected keywords (i.e. stuff you chose not to stock due to your specialism) the site was booted.

Food for thought (but only the well known brands)

<edit: moist people substituted for most people :) >

Hissingsid

4:02 pm on Feb 19, 2004 (gmt 0)

Hi Net_Wizard,

I read somewhere in the DomainPark info that they are able to split up a pushed together domain name into separate tokens. The CIRCA white paper describes something caled a "Tokenizer" which is responsible for splitting raw data into tokens. See Page 11 of the CIRCA Applied Semantics White Paper.

Tokens are what the rest of the World calls words but a word can be made up of more than one token.

Best wishes

Sid

Net_Wizard

4:16 pm on Feb 19, 2004 (gmt 0)

Sid,

That's a great idea about domainpark. However, it raises a new can of worms too, meaning we would be seeing a lot of new domain names string together. BTW, the domain token have been for a while I believe because I have a domain that is just like that word1word2.com. Google have always been able to pull up the site when searching for word1 word2 since I can remember.

But, I only observed this in the domains and not in the actual text within the document.

The tokenizing of words within a document could only lead to irrelevant document coming up in the serp.

Hissingsid

4:29 pm on Feb 19, 2004 (gmt 0)

OK.

I have a domain like this widgetgood.com. I sell a service on it.

When I search for good service my page is #1 out of 1.1 million but I don't have good anywhere on my site other than in the domain name. Other sites link to me separating the words widget good but not including the word service.

This seems to be a winning combination and is quite ammusing because there is a firm called Good Service which ranks #3.

If you can understand what I've said above there is quite a good little SEO formula in it.

Best wishes

Sid

Net_Wizard

7:09 pm on Feb 19, 2004 (gmt 0)

On the same token could just encourage more spamming and open to abuse.

Such as widget, widgetdsfdsf, widgethjhjghj, etc..

All different words even if it doesn't make sense but good for widget. It may even be possible to avoid the OOP penalty(if there is such thing) because you are not really repeating the widget over and over.

I'm kind of tempted here to do some experimentation :)

Anybody would lend me a throw away domain :D

claus

10:59 pm on Feb 19, 2004 (gmt 0)

>> It would explain why portals and directories are ranking so high

Dec 27, 2003: [webmasterworld.com...] (#106)

>> Its just too wasteful to develop semantic maps for search terms that are never used. The 80/20 rule suggests
>> that Google would/is concentrating on the most frequently used terms

As far as why blue widget would trigger and red widget wouldn't--we want to be confident that we're improving a given search before we add in a new feature. [webmasterworld.com...] (#16)

>> The old algorithm is basically intact it seems because searches for non-competitive and otherwise
>> obscure terms ie 'miserable failure' still are returning serps like before.

Dec 4, 2003: [webmasterworld.com...] (#145)

>> Through a process of trial and error they find word combinations that bring them the right results.

Nov 23, 2003: [webmasterworld.com...] (#360)

(i still tend to use the term "broad match" in stead of "semantics" or "ontologies" as the latter are very narrow/specific terms and they might not be the only spices in the soup)

>> It may even be possible to avoid the OOP penalty(if there is such thing)

Dec 3, 2003: [webmasterworld.com...] (#6)

Never seen a dead horse so flogged before...

>> So in Brandy they reapplied weightings for ...
>> CIRCA: On or around a certain fixed date

Looking toward the future, I expect continuing change as we introduce new signals and algorithms into our ranking. Since we're no longer doing monthly dances, it's more likely that algorithms and changes will just roll out after they're ready and have been tested. [webmasterworld.com...] (#16)

Sorry about the telegram style - i'm no great fan of double posting. Also, the mentioned posts were written before "Update Brandy", please bear that in mind (they're not that far off the mark though, imho).

hommealone

5:19 pm on Feb 23, 2004 (gmt 0)

Hi all. Thanks for this thread; it's fascinating. But I'm a newbie and a little dense. Please tell me if I'm understanding correctly:

If I do those tilda searches for the separate words of my two word search term - alpha betas - then I find that Googles thinks the word "primary" is related to "alpha", and the word "carotenes" is related to "betas". These two related words have no strong direct relation to the 'alpha betas' that my site is about.

Are you thinking that if my page contained the words "primary" and "carotenes", that it would be helped on the SERPS? That seems a little far fetched to me; maybe I'm drawing the wrong conclusions from what you've said . . .

Hissingsid

6:10 pm on Feb 23, 2004 (gmt 0)

Hi hommealone,

If I could understand what your conclusions were I would comment on whether I think IMHO that you have maybe jumped to the wrong ones.

What you are looking for is close semantic matches that put your page into the correct topical context.

Best wishes

Sid

Chicago

3:52 am on Feb 24, 2004 (gmt 0)

Sid, you ever read up on Hilltop?

Hissingsid

8:23 am on Feb 24, 2004 (gmt 0)

Hi Chicago,

I briefly looked at a synopsis of Hilltop just after Florida. We were all in a bit of a panic and searching for answers. I was convinced by certain posts here by folks who know what they are talking about that what we were looking at wasn't Hilltop and it wasn't LocalRank either for that matter.

GoogleGuy has confirmed that the new algo includes semantics, Brett said something along the lines of "They didn't buy Applied Semantics Inc. for nothing" and I've seen how people here have implemented what they have learned about LSI to slowly but surely improve their rankings to the very top. Not a flash in the pan and not jsut because of tweaks to the algo at the updates.

I kind of see trying to reverse engineer for an algo that uses semantic indexing as a small but significant part as like going on a diet to lose weight. Long term you need to retrain your eating habits and we need to retrain our SEO habits to add more bredth and context to our pages and sites. Understanding LSI helps you to kick start this process.

Best wishes

Sid

Chicago

2:04 pm on Feb 24, 2004 (gmt 0)

Sid, I couldn't agree more about the use of semantics. But it appears as though the one thing that you are missing in your profesional analysis is hilltop. Please take a look at this well written and recent analysis: Hilltop ~ expert pages and authorities [seorank.com]

Given the patterns that we are seeing, it seems clear that both hilltop and semantics are a play. I do not mean to disrupt the LSI thread, but reading thru your comments suggests that Hilltop may be the only missing link. This is the orginal paper [cs.toronto.edu]

Thanks for your excellent analysis on this topic and thread.

Chndru

1:14 am on Feb 26, 2004 (gmt 0)

Thanks for the discussion, Hissingsid. And a nice link, Chicago.
It's been worth a read.

cbpayne

2:55 am on Feb 26, 2004 (gmt 0)

Please take a look at this well written and recent analysis

Its well written and makes sense, but in "internet time", it not exactly recent. It was first written late Dec - that is two algo updates ago and lots of thinking and analysis has happened since then. However, it is still one of the more sensible and lucid explanations of what is happening.

adamas

2:55 pm on Feb 26, 2004 (gmt 0)

Net_Wizard, I would guess that ink is related to inktomi because many people shorten the name to ink when discussing inktomi. If you search google for mentions of +ink and +inktomi there are 3700 matches in webmasterworld.com alone. Searching the whole google index for +hot and +hotdog returns just 8 results (and one of those is in this thread!) so it is unlikely to connect the two semantically.

Google has always had a bias towards the technical meaning of words simply because the web has a bias towards the technical meaning of words.

Hissingsid

6:28 pm on Feb 26, 2004 (gmt 0)

I couldn't agree more about the use of semantics. But it appears as though the one thing that you are missing in your profesional analysis is hilltop. Please take a look at this well written and recent analysis:Hilltop ~ expert pages and authorities

Hi Chicago,

I've been away from the office for a couple of days so appologies for not getting back to you earlier.

The page you point to looks like a pretty good summary and does detail some effects which we have noted previously and which are confirmed by what happened at Brandy. I need to go back and read the original Hilltop papers to see how the sample is selected from the corpus and what discards are used.

There was some detailed analysis I sent to Google when GG asked for feedback which strongly led me to believe that semantic indexing was the culprit. If I searched for widgets financial I was bak to #1 but if I searched for widget financial I was dropped.

Also if I searched for big widget financial I was nowhere but big financial made me #1 again.

Now if those anomolies can be produced by Hilltop, I'm very interested.

I'm fairly sure that there is some ellement of the new algo which is leading to much stricter de duping. Both a semantic indexing algo and Hilltop would produce this and maybe a hundred other algorithms.

We know that Google had problems with the ODP data. I got an emailing saying that they were aware of the problem and were working on it. If the adding back of ODP data that had been unused since Florida was what Brandy was about then this would add weight to the Hilltop argument.

Very confusing.

Best wishes

Sid

Just Guessing

10:24 am on Feb 27, 2004 (gmt 0)

Hi Sid,

I also do not want to push this thread off the topic of Semantics, but I do also believe that "on topic links" are an important part of the current algo.

The original Hilltop paper has some very specific rules about how a link on an expert page is determined to qualify for a search term - I forget the terminology, but it is all to do with whethehethe link on the expert page is qualified by title text, headings tags or anchor text containing the search term. I do not believe Google has implemented those precise rules, but I do believe more weight is attached to inbound links from a page determined to be "on topic".

I would also agree that there are some algo changes in duplicate content filtering - it seems to me that in some cases for pages with similar but not identical content, the "less authoritative" page is dropped or pushed way down. This even seems to affect DMOZ & Google Directory pages - you don't often see them ranked together like they were before.

Ultimately, an algo combining semantics, topic sensitive page rank/expert system/theming, and duplicate content filtering should be a world beater. I guess Google are finding it harder than they thought and the press reaction hasn't been quite as planned - hence the delayed IPO.

This 150 message thread spans 5 pages: 150