| This 150 message thread spans 5 pages: 150 (  2 3 4 5 ) > > || |
|Latent Semantic Indexing |
| 12:16 am on Jan 11, 2004 (gmt 0)|
Please do not kill me because I remember you the semantic approach, but I definitely believe we must consider the LATENT SEMANTIC INDEXING [webmasterwoman.com]
theory; this put together things like semantic and stemming, and explains way our singular keywords are not affected
Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original
[edited by: tedster at 7:11 am (utc) on May 10, 2007]
| 10:48 am on Feb 10, 2004 (gmt 0)|
I beleive that you have a little gem here that many have overlooked. Well done!
Others should in fact MUST read the paper.
Then consider what if Google has implemented this as a three stage process.
Stage 1. Select a sample using the standard "old" Google algorithm and store the ranking of those pages.
Stage 2. Apply LSI excluding the actual term used in the search. This ranks #1 the site which has the strongest match to the theme around the search.
Stage 3. Combine the two rankings and present the results in the users browser.
In fact as they are only doing this on a list of terms the LSI processing could be done offline and the vector stored for combination with the standard rank.
This explains how a so called OOP penalty could be applied.
LSI in this implementation could have a site element. Larger sites and directories would have a better chance of having themes close to the term than smaller more concentrated sites.
It explains why some pages that don't even have the term on them rank #1
If you search for term +www you throw out the LSI vector in the algorithm since this is only stored for the terms that Google has identified for this form of analysis.
~widgets -widgets is very useful in showing you terms that area a very close semantic match to your term for incorporation in your site.
This is the silver bullet.
Thank you Marin.
| 2:13 pm on Feb 10, 2004 (gmt 0)|
This is the prime reason Google bought Applied Semantic in 2002.
I can confirm that this technology works well. I can also confirm that it has problems if extended too far--which I think Google has done.
Back in 1998 I was publisher of several trade journals. One was for the equine (that's horses) industry. This new employee with the funny title of "webmaster" (I thought she should wear black leather with that title) took our text and put it on this thing called a "web site."
In a meeting she demonstrated this software called a "search engine." You could find any article easily. Just type in what you're looking for, and hit return, and ta da! there it was.
OK! Our most popular story--often reprinted--was a basic feature on colic. So, I typed in colic. Couldn't find it.
Turns out, the article never used the word colic, other than the headline, which was not indexed. Instead, the author used the more informal words, such as "tying up."
This happens more than you think, especially on powerful authority sites which use their own short hand. (For example, to find the most useful posts on this site searching on the subject of "search engine optimization" you'll do better searching on SEO.)
This excellent paper--it is, indeed, a must read--talks how this is addressed in stemming. But how we talk evolves faster than you can image. (Again,a good example is SEO.)
Google knows all of this, let me assure you. And they will get it worked out. But, I'm convinced there is a market for services that provide custom access for readers with different interests and different communities which Google will not be able to address for years.
I don't know: Maybe you could get a Google franchise to handle a certain area? Wait! Forget that, this sounds like Sprinks. Wait again, Google bought Sprinks.
| 2:17 pm on Feb 10, 2004 (gmt 0)|
I think you're on to something guys, let's see what other webmasters think:
Note: the original url is no longer online, but I've
edited the link to point it to a reprint of the original
[edited by: tedster at 4:28 am (utc) on May 10, 2007]
| 4:53 pm on Feb 10, 2004 (gmt 0)|
I first looked into this and thought it might be part of the cause of Austin's devastation about a week ago. In doing so, I did find that many of the sites ranking high in my keywords had quite high numbers of terms Google related to our keywords. (notice, I say many, not all)
I don't think this is the only change, but it appears to me to be a major part of it.
I am quite convinced that this is a major part of the new algorithm. It also is evident to me that this is not calculated on the fly. That creates a sort of glass ceiling for certain sites. IF you do change things, you won't know if you're successful until after they calculate this thing again.
| 7:56 pm on Feb 10, 2004 (gmt 0)|
|That creates a sort of glass ceiling for certain sites. IF you do change things, you won't know if you're successful until after they calculate this thing again. |
This is true. For many reasons, obvious and not so, Google sees this lack of indexing as their job to fix, but won't let anyone give them a hand in it.
As far as selling ads goes, Kanoodle, other the hand, has said that, hey, let's just ask the web sites (and the advertisers) to ID themselves to where they are suppose to go. Quigo is going to have a means of getting a bit of a helping hand from the web site, but with a lot of AdSense-type software filtering (and now you know how it works!) thrown in as well. (I think both Kanoodle and Quigo are heading to the same space, just from different starting points.)
If Quigo and Kanoodle look a little silly, Check this out: As far as indexing the web, Google is getting some help (if they want it or not):
The World Wide Web Consortium (W3C) Tuesday (Feb 10, 04) passed two key standards for helping computers get more information out of the applications they are processing and match content more appropriately for end-users.
The standards body unveiled the Resource Definition Framework (RDF) (define) and the OWL Web Ontology Language (OWL) as part of its plan for the "Semantic Web," which W3C Director and Web pioneer Tim Berners-Lee described as a "great big database" at an event last year.
The idea behind the Semantic Web is to give data more meaning through the use of metadata, (define) which describes how and when and by whom a particular set of data was collected, and how the data is formatted....
Bottom line? Search is going to get better--but this isn't for children anymore.
| 10:42 pm on Feb 10, 2004 (gmt 0)|
From the paper on LSI:
|Make a complete list of all the words that appear anywhere in the collection |
1.Discard articles, prepositions, and conjunctions
2.Discard common verbs (know, see, do, be)
4.Discard common adjectives (big, late, high)
5.Discard frilly words (therefore, thus, however, albeit, etc.)
6.Discard any words that appear in every document
7.Discard any words that appear in only one document
This process condenses our documents into sets of content words that we can then use to index our collection.
Assuming that semantics and stemming are an integral part of the algo, does this mean that the way Google looks at keyword density is different than how most of the kw density tools calculate density?
| 9:08 am on Feb 11, 2004 (gmt 0)|
|Assuming that semantics and stemming are an integral part of the algo, does this mean that the way Google looks at keyword density is different than how most of the kw density tools calculate density? |
I'm sure that it is much more complex than this but in simple terms I see this as probably a three step process.
1. Search for sample to be analysed using standard Google algo. Store some form of rank index.
2. Analyse sample from 1. using LSI but in addition to the words thrown out in the standard LSI also throw out the search term.
3. Combine the results from 1. and 2. and present the SERPs to the browser.
In this way you still get some benefit from being high in standard rankings but pages/sites that are just stuffed with the search term without a good measure of closely semantically related terms (in the view of the Google/Applied Semantics Ontology) get dropped back and those that have very close related words rise up. Also words in some places are better than those in other places on your page. Links to pages either on your site or outside that are on closely semantically related topics seem to be given more weight than words in body text for example.
What you need to do is to look at your pages excluding the search term. Now is it still very clear what the page is about without those words. Think of closely related terms that you could use to make the page about the broad topic and up you will come.
If you need help finding what Google thinks are the most closely related terms to your own do this search ~widgets -widgets and look at the words that are highlighted in the results. Repeat this with all of the words in your search terms and build those into your page and have pages on those on your site.
I'm comming to the conclusion that it is not necessary to remove or lower the density of the terms in question. It may be sufficient to simply add in related terms and pages on related terms.
What are related terms. First of all they may not be what you think they are. Do the Google Synonym search detailed above. Look at the pages of your competitors that are ranking highly excluding the terms involved and then add in some of those terms into your pages paying particular attention to having them in anchor text and having pages on those topics at the ned of the link.
Thats what I'm doing anyway!
| 1:38 pm on Feb 11, 2004 (gmt 0)|
You have got to have the keywords on your site, obviously. The point is many different web sites can have the same keywords. How a search engine filters those kw to determine which web site is the authority is what is getting refined.
The examples I've used here are so basic as to be almost meaningless, but the point is important: Look at how the high ranking sites are speaking about the search term might give you a clue to how that high rank was achieved. Yes, linking is still important, certainly.
My thinking is that the filtering is going to be more sophisticated on the more important or popular searches. Take, for example, breast cancer or any other health topic. The rankings here are very sophisticated.
But, again, one has to have the words on the page. If your web site is about Anytown, Alabama, it should have Anytown, AL, and Anytown, Ala. all in the text so that a search engine can find it. Now, the question is, in this case are search engines such as Google building filters to as to give higher ranks to sites that assume authority with city names in combination with phrases such as "government" or even "directory" or "news."
| 2:40 pm on Feb 11, 2004 (gmt 0)|
You really need to go and read this [javelina.cet.middlebury.edu ]
The point is that it explains the generalities of how CIRCA works. CIRCA is a form of Latent Semantic Indexing which belongs to Google.
What I'm suggesting is that yes you do have to put the terms in your text. Those as part of the old Google ranking will get you into the LSI game. Then it comes down to close semantic matches.
Organic old style Google ranking plus CIRCA/LSI vector gives you the current rank.
Google doesn't have a sliding scale for different classes of search it doesn't need to. If a term is in the ONTOLOGY then CIRCA can find a word match map. If we assume that it then disregards the actual term then the sites that will rank well are those with authority terms.
There is a lot more to it than this but in principal that is the basic knowledge. What you do with it is up to you.
|More Traffic Please|
| 3:34 pm on Feb 11, 2004 (gmt 0)|
|What you need to do is to look at your pages excluding the search term. Now is it still very clear what the page is about without those words. Think of closely related terms that you could use to make the page about the broad topic and up you will come. |
The thing I find interesting about this theory is how it might explain the way queries like "AnyCity real estate" and "AnyCity hotels" took such a big hit. Finding closley realted terms for real estate is not very hard at all (homes, for sale, Realtor, realty, land, property, etc.). But, what if the city name is very unique, like Verton? If the page is now erased of that term, that creates a problem. To the best of my knowledge, there are no similar terms that would indicate the page is about real estate in Verton (a fictious city). In the eyes of Google, the page would just be another page about real estate.
So what happens when a search query contains words that there are no closley realated terms for?
| 3:46 pm on Feb 11, 2004 (gmt 0)|
|So what happens when a search query contains words that there are no closley realated terms for? |
very good point. many of my sites are technical in nature, and the keywords aren't real words - they were made up by someone when the technology was created. an example would be if i made a new technology to measure the amount of times a dog's tail wags in response to different situations, i might coin the term, wag-o-meter. so, if google decides to dump that term, it would be very difficult to find related terms.
| 4:09 pm on Feb 11, 2004 (gmt 0)|
|To the best of my knowledge, there are no similar terms that would indicate the page is about real estate in Verton (a fictious city). In the eyes of Google, the page would just be another page about real estate. |
What about the state that Verton exists in, or perhaps the region? These are much more common, and therefore less likely to get hit. There are tons of pages that contain "Ohio" or "Middle East" and these are far less likely to get bumped. Yes, it means that all the optimization for "Verton" is shot, but I think it should be recoverable.
|More Traffic Please|
| 4:34 pm on Feb 11, 2004 (gmt 0)|
I agree that a search for Ohio real estate may give better results. But, if I only want to know about homes for sale in Verton, I'm not interested in inner pages from the Ohio Department of Real Estate that only mention the word Verton once on an 800 word page. These are the type of results I started seeing in real estate searches after Florida. I'm just wondering if it could be because city names are often unique and similar terms in other documents may be difficult to come up with. As a result, the search engine is forced to give you serps of a much broader and less relevant nature.
[edited by: More_Traffic_Please at 4:44 pm (utc) on Feb. 11, 2004]
| 4:40 pm on Feb 11, 2004 (gmt 0)|
|6.Discard any words that appear in every document |
Ummm, does this really mean that a site about 'green widgets' - which uses the phrase 'green widgets' on every page - would have the words 'green widgets' tossed out for *all pages*?
Gee, if something like this were really in place now, and I'm not saying it is...it would explain more than a few things.
| 4:54 pm on Feb 11, 2004 (gmt 0)|
I took heart in some of your suggestions several weeks ago and am now bouncing from page 1 to page 2. Definitely better then being lost post Austin.
I think you have some valuable info here. Old style Google ranking plus CIRCA/LSI is probably the last step as to why I can't get over the #1 hump. The site was #1 for a couple of years for very competative keywords. This combining of 2 algos makes sense when viewing what happened to me.
I have posted in the past of the steps I took over the weeks to get back. I used the word synonyms as how I approached this. In the kids dictionary, for those like me who need things spelled out simplistically, the definition is as follows...
Synonyms are words that have the same, or almost the same meaning.
The words stones and rocks are synonyms.
I took this approach and SLOWLY adjusted keyphrases in the titles and body of some pages. After every change I waited several days to see if the results were positive. Not only were they positive, it seemed to have attracted googlebot to crawl more often. Don't know if this is related, but this is in fact what happened.
Stuffing KWs is pretty much dead. I would pick several good phrases and try to write naturally to portray what you are trying to say by alternating the phrases with synonyms.
Just my 2 cents.
| 4:58 pm on Feb 11, 2004 (gmt 0)|
|I'm just wondering if it could be because city names are often unique and similar terms in other documents may be difficult to come up with. As a result, the search engine is forced to give you serps of a much broader and less relevant nature. |
That would certainly explain a lot of the problems with city real estate and travel related SERPs.
I just did a search for ~city -city on 10 well known cities.
And the same for states.
None of them had any matches.
There are no pages that have a synonym of ****** on them without the word ****** on the pages.
If you do a search for the same state like this ~state all of the pages have the actual state name on them not a synonym. It seems that there are no synonyms for place names. So NewYorks sites can't even use "The Big Apple".
In my own case accoring to Google there are synonyms for my problem word but these are not in normal use in the UK they are US centric synonyms. I can at least work around this problem by introducing other limeys to the glory of American English ;)
I hope someone from Google is reading htis because this is the new algos achilese heel and they need to get the spear out quick.
| 4:59 pm on Feb 11, 2004 (gmt 0)|
|really mean that a site about 'green widgets' - which uses the phrase 'green widgets' on every page - would have the words 'green widgets' tossed out for *all pages*? |
isnt the point that this is probably a second phase of sorting?...so all pages have already been ranked for green widgets, now they are being re-reanked on related factors...so they arent in effect dropping the term..they are applying the theme
| 5:04 pm on Feb 11, 2004 (gmt 0)|
|Ummm, does this really mean that a site about 'green widgets' - which uses the phrase 'green widgets' on every page - would have the words 'green widgets' tossed out for *all pages*? |
Don't forget that the published paper describes one take on LSI. I'm pretty sure that CIRCA is a variant on this.
My hypothesis is that the term searched for is excluded or its value markedly reduced. If they are creating the sample for analysis using the old algo then it is a safe assumption that most or all of those pages will have that term on them. What seperates out the authority from the lesser sites is words associated with the term.
| 5:31 pm on Feb 11, 2004 (gmt 0)|
|isnt the point that this is probably a second phase of sorting?...so all pages have already been ranked for green widgets, now they are being re-reanked on related factors...so they arent in effect dropping the term..they are applying the theme |
Ummm, I think so. My brain hurts a bit too much to be certain. ;-)
But to Sid's points, on the now much discussed points of stemming, CIRCA, etc.....
Unless your homepage has:
a) a very high PR
b) a substantial number of kw's tightly related to the words you care about, or,
You have little hope once the re-rank is done, of showing up for your really important search phrase: only for the related phrases, which may or may not be what your site is really about. Or at least, that is certainly true if, as pointed out above, there are not many good substitute words to use.
In our case, we see lots of sites showing in the SERP's that have kw's tightly related to our important kw's *on a purely semantic basis* but that don't have any real relevance to the searcher.
And adding in a lot of those tightly related kw's, the way that G might prefer, would make our homepage confusing at best, if not downright bizarre.
More evidence that if the old game was hitting targets, the new one is overcoming hurdles....
| 6:18 pm on Feb 11, 2004 (gmt 0)|
An interesting related note about city names: One of the historical topics covered by my educational site is also the name of a major American city. Say this page is about the "Martian War" and there's also a city, "Martian, NY."
The indexing is fine: this page shows up for "Martian War" and some related phrases, and not at all for Martian. In our case this is exactly right. Very few people searching for "Martian" would be interested in our site.
However, there are adsense ads on these pages, and I notice that this page is the *only* one for which Google serves the public-service ads (on all of the other pages Google is running ads for history books and other pertinent things). A page about the "History of the Xyzzy War" may have ads for books about military history on it, but the page about the "History of the Martian War" does not. Google gets too confused by the presence of the city word to notice the words "history" or "war" anymore. The prominent presence of the city word seems to really throw Google for a loop as far as deciding what the page is "about."
I suspect that if "Martian War" were a term lots of people were competing over, then that page of ours might well be fluctuating wildly in the SERPs.
Just a thought... As Google refines their algorithm, they may improve the way it handles city names, which could solve a lot of these problems.
| 6:57 pm on Feb 11, 2004 (gmt 0)|
|If we assume that it then disregards the actual term then the sites that will rank well are those with authority terms. |
Why would you assume that?
My experience with different web sites (I have a few wag-o-meter for dogs sites; we have the TOP result in Google) indicates that value is added to having terms that stems from the ontology, but it doesn't disregard or replace that term.
Other wag-o-meter for dogs have come on the market, but we have maintained our number one position not by links (only three sites link to us), but having "dog" and "tail" in the text.
The City, State SERPs are interesting. This might be the area where Google has to sit down and make a decision what they're going to provide and then users ca adapt. Right now, what you get depends on what city and state you're talking about. It's not good and this is a major use of Google.
| 6:58 pm on Feb 11, 2004 (gmt 0)|
<sarcasm>its hard to believe that what this thread is about and what google is doing is the same thing. When you see the sophisticated maths involved and then look at the rubbish being returned, they cant be the same thing, right?</sarcasm>
| 7:01 pm on Feb 11, 2004 (gmt 0)|
Science run amok.
| 7:57 pm on Feb 11, 2004 (gmt 0)|
The LSI paper briefly touches on the possibility that a human librarian might use LSI to work more efficiently. My question is, if Google were to apply LSI, would AdSense provide feedback on relevancy further enhancing the human librarian's efficiency?
What I mean is if you were to apply LSI to every page which had an Adsense script on it, naturally some of the AdSense ads would get higher CTRs. Someone searching for "Who let the dogs out" would be less likely to click on the ad for a Bassett Hound Cookies site than he would for an Amazon Baha Men CD page. So would this sort of feedback aid in further refining the serps or in finding the places where a human being's intervention is needed?
| 8:14 pm on Feb 11, 2004 (gmt 0)|
I don't know-- (I don't know enough about it; I'm not the one making advertising decisions for our site--) but it certainly seems like AdSense is providing feedback to US. Seeing how consistently this one page of ours gets served with PSA's, I'm definitely on notice that Google doesn't understand that page. As it's a niche informational topic I'm not willing to mess up the writing of the article over this, but if I were an e-commerce sort of person I'd be busting my tail to get the page to the point where AdSense can figure out what the page is about. Someone was complaining about having to wait weeks to see whether their improvements have helped or not... well, it seems that AdSense provides some immediate feedback of sorts. If the ads are off-topic or Google can't even figure out which ads to put there, it's a potential problem for your site. If you change your site and the ads become on-topic, maybe that means you will do better for Google searches on your topic soon.
Just a thought.
| 8:47 pm on Feb 11, 2004 (gmt 0)|
|Why would you assume that? |
| 9:40 pm on Feb 11, 2004 (gmt 0)|
Why would you assume that?
I see your point.
| 9:46 pm on Feb 11, 2004 (gmt 0)|
|What I mean is if you were to apply LSI to every page which had an Adsense script on it, naturally some of the AdSense ads would get higher CTRs. Someone searching for "Who let the dogs out" would be less likely to click on the ad for a Bassett Hound Cookies site than he would for an Amazon Baha Men CD page. So would this sort of feedback aid in further refining the serps or in finding the places where a human being's intervention is needed? |
Yeah, this get's back to the Sprinks and niche search point I made. Google just simply can't be everything to everyone, so they're going to be most things to most people.
Sure, they'll still help you find the vendor of bird diapers for your parrot but there will be this huge middle ground (horses, for example) where it's just not working.
Murphy's Law as applied to journalism is this:
Everything you read in the newspaper is absolutely true, except those things which you have first hand knowledge.
Google's search results are going to run into this law. That is, the results are great, except if you know something about the subject you're searching.
| 9:55 pm on Feb 11, 2004 (gmt 0)|
Many thanks for the LSI link, interesting read.
One problem with this LSI approach seems to be lack of importance given to HTML tags such as TITLE and Hx.
If the keyphrase is weeded out of the analysis on the grounds that all pages in the set contain it, then no weight can be given to having keyphrase in the title.
Sorry, Google, but if fifty pages have similar semantic relevance to my search phrase then I am much, much more likely to be interested in the one with my search phrase in the title and/or H1 tag.
Another problem would be that LSI has no way of telling if a page has genuine content or is simply a random collection of semantically related words. Such a collection could be....hmm, let me think... oh yes, a directory page. Nah, Google would never be daft enough to let one of them appear on page 1 of the SERPS.
All the PhDs in the world can't bestow common sense.
| This 150 message thread spans 5 pages: 150 (  2 3 4 5 ) > > |