|What is LSI?|
Hello every body there were rumors before that google is using Lsi for the searches is it right and if it is then what things should a seo focus on to get high rankings..
LSI : Latent Semantic Indexing
To cut it short Google will 'understand' content by finding relationships between words and phrases. Search online and you`ll find some detailed info on this and how it's done. Many PDF files on the subject from some cool Universities.
Here's a WebmasterWorld thread which includes a link to the clearest presentation of the subject I've read, which is the original NITLE article, Patterns in Unstructured Data, by Clara Yu, John Cuadrado, Maciej Ceglowski, and J. Scott Payne....
Latent Semantic Indexing
If anyone is thinking about or considering "LSI Optimization," there's no such thing.
It's important to understand the concepts behind it, as well as some of the underlying IR concepts, like IDF and KW co-occurrence; but if anyone tries to sell you LSI SEO services, ask them how big the dataset is that they're using.
Google's data for use in any kind of semantic analysis they would use consists of words and (and more importantly) phrases derived from more than 8 billion web pages, and then some - like from documents other than web pages.
[edited by: Marcia at 4:51 am (utc) on Dec. 13, 2007]
I don't think its a coincidence, that sites running tag clouds, instead of parent child category structures are now doing very well.
But it ain't always the case either.
Tag clouds will be completely ineffective and will only get a site listed in virtually invisible positions if the site isn't relevant to its tags.
There are two ideas at play, one is bringing most of the pages to the highest possible levels ( least number of clicks from home ), the other is to implement a highly relevant navigation. Unoptimized clouds ( ...which is more like a usability than an SEO issue ) will do nothing for you.
Similarly as Wikipedia's cross referencing method, blog entries listed by category AND their titles, or even a well distributed parent/child or breadcrumb navigation, ranking depends on PageRank and relevancy. Only difference is that clouds give you a pretty good alibi for implementing like 60+ competitive words/phrases onto every page for 'better usability' - thus leaving your site without getting a manual penalty for excessive navigation. As a user I don't like tag clouds, ... no, make it, I feel they're completely useless for any other than entertainment purposes.
mmn, wikipedia is not a good example, as each word(in text) which is also a category links to that category... their articles are one big tag cloud, with only joining words (at the it) not tags.
Oh i hate tags too, usually, they are a nonsense, but my job is seo, and if you look at current trends, well done tag clouds are performing in google far better than top down category sites, at least in the industries i monitor.
What I meant to say is that the reason why 'tag clouds' may perform better is still based on generic guidelines for designing a proper navigation ( relevancy, flow of PR, co-occurrence, filters, whatever ).
As for Wikipedia, I wouldn't call titles and alternative titles tags. The point of tags is generalizing stuff, being able to re-use tags. The unique combinations of common tags may or may not produce unique relevancy for the target page. Tags find common aspects and use anchor text fit for one or at most two word queries. Categories of sorts, which then will create a cross-fire of relevant nav links. It's like making Google guess what the page could be about without adding every such page to the nav.
Is it blue? yes. Is it round? yes. Is it a widget? yes.
Oh. then it's a blue round widget, isn't it?
But there are other blue, other round and other widget pages as well. Some are even tagged '2008', some are tagged both blue and red. Some aren't tagged at all.
But Wikipedia uses exact phrases, on-spot relevance, and addresses all articles by their own titles / alt. titles, subtitles, and not generic categories. So unless every keyword or phrase that is used for cross-referencing is considered a tag, Wikipedia doesn't use tag clouds.
It uses extensive cross-linking, which makes its navigation virtually horizontal. But in fact the base IS top down... but that little stream of PR is overrun by a river of the cross-references.
I do not remember the last time that I was so interested in a topic that I re-read it three times…
|Tag clouds will be completely ineffective and will only get a site listed in virtually invisible positions if the site isn't relevant to its tags. |
Are you suggesting that a very general site, such as an article site; would not benefit from tags? (or tag clouds…, which in my eyes are the same as tags, but in a different size font, please correct me if I am wrong here).
[edited by: kamikaze_Optimizer at 9:15 am (utc) on Dec. 15, 2007]
tags are a separate issue.
here are the basics of LSI. If you have a really, really big collection of documents (and Google does), you can create a pretty good vector representation of all words in existence. Vectors can be added together, so you can do whole documents too.
This means you have an estimate of...
* how similar two words are to each other;
* how similar any document is to any word;
* how similar two documents are to each other.
Similarity scores are between 0 and 1 (the cosine of the vectors). 1 means "identical". 0 means "not similar at all".
The end result is, you have a better system for matching queries to documents. The search engine can just retrieve the vectors and run a simple calculation. You don't have to do "keyword matches". You don't have to use stemming. The LSI takes care of all that. It takes into account every word in the document, and whether they are of a similar theme to the query.
Simple. Cheap. And (moderately) effective.
|....basics of LSI....Simple. Cheap. And (moderately) effective. |
As anyone in IR research who has messed with LSI will tell you it's the exact opposite.
Simple? 99.9% of articles on the web on the subject of LSI are nonsense and demonstrate the difficulty most people have in understanding the concept. If it were simple this would not be the case.
Cheap? Not cheap but expensive, computationally it's about as expensive as it gets.
Effective? The jury is still out on the results for medium size corpora and nobody has tried anything like even 1% of the size of the web yet.
|....basics of LSI....Simple. Cheap. And (moderately) effective. |
As anyone in IR research who has messed with LSI will tell you it's the exact opposite.
Okay, maybe I didn't express myself clearly.
"Simple": I didn't mean 'simple for the general public to understand.' I meant simple to implement, because the technology already exists.
"Cheap": You're right, it's not cheap to compute the space. But once you've got it, the vector algebra is computationally cheap.
"Moderately effective": It's an open question just how valuable it is. Its information retrieval value is dubious, but it cuts ice in natural language understanding. As for corpus size, there seems to be a law of diminishing returns.
[edited by: callivert at 11:34 am (utc) on Dec. 15, 2007]
Supposing they had a workable method that can better evaluate textual content, where and how might it be factored in?
My guesses would be:
1. Link anchor text value dialed down.
2. Topical link value dialed up.
How about Keyword Co-Occurrence and Phrase-Based Indexing [webmasterworld.com]?
Six patent applications worth, with repeated emphasis on the ineffectiveness of using just "terms" (words) as previous systems did, but rather looking for related phrases.
Incidentally, LSI using SVD is a patented process, with the patent applied for in 1988 and granted in 1989 - not to a search engine, either.
[edited by: Marcia at 11:59 am (utc) on Dec. 15, 2007]
|"Cheap": You're right, it's not cheap to compute the space. But once you've got it, the vector algebra is computationally cheap. |
I assume they take a subsample of desired or "clean" documents, ie wikipedia and then throw that algo on the rest of the web.
That way it is really computationally cheap.
There was a time around january to march 2007 when suddenly all WP clones where up for one 3 month algo cycle. Exactly what would happen if you would do above. Then they might have realized the side effect and cleaned that out.
Well I hope Google isn't so blatantly ignorant to try to actually assume they could understand the web with an AI algorithm only. A human decides what is relevant and then you model on that. That way real intelligence comes in and not the intelligence equivalent of a newborn cockroach.
"The space" is probably quite small. But maybe they have completely lost the plot.. and try to "understand" the whole web. The original backlink algorithm was using human intelligence, so why would they want to compute the whole space suddenly?
When you look at a search where a famous car maker has the same name as an animal. You define that you look for the animal and then you get bold stuff in the title tag and the url. Looks pretty basic to me.
I think what happens. Some basic ranking then it is checked that the document actually has meaningful sentences based on a subsample like WP. The "White House-George bush" approach.
Pages that have only a picture of the animal with a short tag of the species name are ranked below the page explaining the biology of the animal.
My site ranks #1 this way + all the other picture pages in see more results from this site.
Even right now, 'Relevant' doesn't necessarily mean 'syntactically relevant' as in, documents don't have to use the exact phrase anymore, nor get linked with the exact phrase to have the potential ( or latent potential ) to come up for a search.
Google does have thematic categories.
These aren't really defined, rather are but a virtual concept that depends on what the initial word ( set ) you analyze is. For example if it's 'mycity information' a lot of phrases and words will be considered relevant from 'mycity restaurants' to 'mycity population' to 'mycity news'... while 'mycity accommodation' will land close to 'mycity hotels' and 'mycity tourism' but not 'mycity taxes'
Even if it's not weighed in for all URLs in all sectors ( themes / categories ) at this moment, even if it's not even calculated or applied, they do have a basic concept and the technology to identify and sort documents based on thematic relevancy.
Once Google has decided that a source is an authority on news, it will behave differently from crawling through indexing to ranking the documents of that website. Whatever appears on the domain which is relevant to 'news' ( either a brand new topic, or a topic discussing something of current interest ) the new information will be displayed as a relevant result for the topic way ahead of any other sources. This is more like the effect or product of their efforts, and not the core of the system, but... either way it goes to prove that they can identify items of current interests ( and tell them apart of generic searches ), they can identify items that are relevant to news ( and tell them apart from just being popular ), and they can identify websites that can be considered as news sources ( and tell them apart from article drop sites, blogs, spam, whatever ).
A more complex - and down to Earth - example is contextual advertising.
Google has , okay so it HAS TO be able to tell the difference between syntactic and semantic relevance. So that the 'blue cheese' topic wouldn't be considered relevant to 'blue sky' or 'blue ray'. AdWords already applies a basic category structure to be able to provide a candidate list of publisher websites within the content network for advertisers. At least AdWords API has this feature. Some categories are broad, some sub-categories are very narrow, and they seem to be calculated based on the popularity of certain services / advertising trends.
And what's very current, all the -950 results you can get.
There are of course other possible reasons, but a few of the most common are: lack of support for a theme ( ie. lack of related phrases 'expected' to be present, not enough 'vectors' to calculate with ), distroted signals ( word order, language doesn't make sense, overuse, especially in the navigation ), low signals ( document is, but the source is not relevant, no IBLs supporting the phrases or their variations, ie. there are not enough vectors to calculate with ), and related phrase overuse ( ie. vectors end up off the paper... in the red zone, call it what you will. Okay, let's call it repetition and/or spam so no one starts throwing out valuable content. )
A keyword cloud or tag cloud or any such highly generic navigation will produce a wide range of thematic relevancy within the site. If it's structured well, of course. Google is getting better at serving results that were made for people and not its then infant AI, meaning if you build sites to send out signals humans can't make sense of, you'll end up -950 or simply, out of the index.
All in all, if the navigation ( and all other relevancy signals ) make sense to a person, it'll make sense to the 'agents' of programmer/linguist/search quality people all the same.
It's usually not the lack of relevancy within the content that will send the sites to mid-nowhere on the SERPs but the overall confusion with the theme of the website, and/or 'robotic' signals, stuff made by or intended for audiences other than humans. Inbound links, navigation, either too many vectors, too few or too confuzing to be intended for people. There's an awful lot of sources to look at, and a relevant document originating from an irrelevant source will not be listed (at the top).
So if you want to have a wide range of ( usually ) overly generic keywords rank, whether they're in a keyword cloud or not, you have to provide the proper signals towards google that you, as a source, are relevant to these topics. And don't do SEO that'll make apple pie taste like metal.
And with all the wonders of analyzing content for Semantic Indexing, most of the checking goes for anchor text, so most of it is still decided based on links. I can't begin to stress how relieved I am for most people have no clue on what makes a site pass the thresholds. You could have millions of links and your authority still not proven. While you can have but a dozen references and clear the requirements so that your content, already considered relevant by the rest of the algo... can show up at high positions.
|When you look at a search where a famous car maker has the same name as an animal. You define that you look for the animal and then you get bold stuff in the title tag and the url. Looks pretty basic to me. |
That's polysemy, two words the same but with different meanings. That's where word sense disambiguation comes in, and the car and the jungle animal will have completely different co-occurrence profiles.
|That's where word sense disambiguation comes in.... |
Marcia - Thanks for jogging my memory here. Disambiguation, as I remember, is where data retrieval experts I've read feel LSI is the most useful. Not to rank pages initially from a large data set, but to pull out results most appropriate to a query from a smaller set of pages.
Robert, there's a bit about word sense disambiguation in the first post here in this thread, and the Powerpoint presentation cited is definitely worth a watch or two.
Incidentally, here's a transcript of presentation by Dr. Susan Dumais, one of the pioneers in LSI, who was with Bellcore and has now been with Microsoft for a few years.
Notice the part where she makes mention of the size of the data sets, the time element and processing power, and specifically makes mention of some 24 hour timeframe.
I think the basic LSI ideas are closely related to, and have been elaborated towards the phrase-based-indexing idea, which tedster brought into discussion on various occasions.
So it is not only word-co-occurance, as in the classic LSI papers, but co-occurance of clusters of words = phrases. But I do not know whether this has been investigated in recent LSI-research as well using maybe a different terminology.