homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 66 message thread spans 3 pages: < < 66 ( 1 2 [3]     
Phrase Based Multiple Indexing and Keyword Co-Occurrence
Is this all themes with a new suit of clothes on?

 11:39 pm on May 10, 2007 (gmt 0)

Using a timeline as a starting point, in November, 2003 there was a quiet introduction of Google's use of stemming, whereas before they had stated that they didn't use it. That was also the month of the Florida Debacle, the update that shook up the SERPs, and talk about LSI/Latent Semantic Indexing started. In short, according to what I've read LSI isn't feasible for a large scale search engine because first off, as such it's a patented technology, and secondly it's very resource intensive.

LSI also uses single words - terms. However, 8 months later there was a series of patent applications filed by Google that dealt with Phrase Based Indexing. Phrases, not words. Not all those apps have been published yet, but 6 have been. Six, not five.

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

In logical sequence:

Phrases are identified:

Phrase identification in an information retrieval system [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Documents are indexed according to their included phrases:

Phrase Based Indexing in an Information Retrieval System [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Users search to find sites relevant to what they're looking for:

Phrase-based searching in an information retrieval system [appft1.uspto.gov]
Application filed: July, 2004
Published: February, 2006

The system returns results based on phrases, including functions which include generating the document snippets:

Phrase-based generation of document descriptions [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Use of a partitioned multiple index system to conserve resources and space:

Multiple index based information retrieval system [appft1.uspto.gov]
Application filed: January, 2005
Published: May, 2006

Phrases are used to detect spam documents:

Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]
Application filed: June, 2006
Published: December, 2006

What's interesting about the publication date of that last one is that it attracted widespread attention, and it was only weeks later that an "unofficial" request was put out about reporting paid links. Matt is not only adorable, he's very smart and has an impeccable sense of timing. ;)

There's been plenty of discussion on that aspect of those patent apps, and I'm sure we'd all be delighted if someone wants to start another thread on the topic, including spam, recips, money keywords and Adwords conspiracy theories, but while there have been write-ups about the patents and recaps simplifying what's in the documents, I haven't seen in depth discussion about some of the IR principles that the system embodies. So I think we can all get a better grasp if we discuss those and try to get closer insight into the system.

Keywoord Co-Occurrence
For starters, there are repeated references throughout all those documents to the term co-occurrence. In fact, in just a few short one or two word paragraphs in one of them, the word was used ten times. That seems to be the underlying principle that makes the whole system tick.

What it basically means, at the simplest level, is words or phrases that appear together. The patents go into detail about what's looked for, and statistics on co-occurrence patterns are used to relate clusters of terms/phrases into coherent "themes" and make predictions based on those statistics.

Word Sense Disambiguation
There's always been a problem with contextual relevancy and grasping the intended meaning concerning what a searcher is looking for with words that have several different meanings, or what a document can mean. That's polysemy. Polysemic words are those that are the same but can have more than one meaning. Example: is bow referring to a hair bow, or a bow and arrow or a violin bow?

Really, the only way to be able to tell with ambiguous words would be to look at other words (or phrases) that co-occur - appear with - the word; or in the case of a phrase based system, phrases that co-occur enough across the whole corpus of documents enough to be able to discern the meaning or "theme" of a given page when it uses the word.

There's a difference between word sense disambiguation and word sense discrimination, and it's explained very well here (clear browser cache, it's a PPT presentation):

Powerpoint Demo on Word Sense Disambiguation [d.umn.edu]

The main difference is that one starts out with a pre-defined lexicon (like LSI). Also, I've got copies of the original Applied Semantics patent and white papers (there were 2), and it seems that was also lexicon based. With phrase-basd indexing, it seems that it starts out with a blank check and creates a taxonomy on the fly based on discovering co-occurrences in order to constuct the co-occurrence matrix.

So given that the terminology is used in profusion throughout, it's my feeling that we can benefit by discussing it among ourselves, as well as looking at how the "multiple index" system is set up. Those aspects might well clear up some of the mysteries for us.

Anyone game?



 1:13 am on May 25, 2007 (gmt 0)

So even though we don't know specifics there are some things I am curious about. For starters I'm wondering what sampling Google would be using to sort out the good and bad phrases with all the refinements mentioned in the patents. Or perhaps it wouldn't be a sampling at all but an analysis of all the information they have gathered from the internet.

Where would one get bad phrases from as they would go into infinity. So I guess they grabbed their favorite server WP and maybe some other dictionaries. Whatever isn't on them might be bad phrases.

Besides some specific words like SEO, WAREZ, afilliate and so on, I wouldn't know how else you would get a list of words and phrases that would be unrelated.

I think these filters are definitely at work, how else could a wikipedia snippet rank on page 1 (having passed the extrenal duplicate content filter as it's diluted with boilerplate content), while a completely valid school teacher script with 20 times more info is 950ed. The more detail and the better text and the more background info that the simplistic WP doesn't have the more likely an author is to have a bad phrase. An extensive professional script is more likely to hit a filter than all the lexika entries that I see now in the German SERPS.

Given the tabloid formula of life any extensive information would also increase your bounce rate as users want short answers yesterday and Google does what users want and don't seek for quality.

That doesn't mean a long article is 100% cert to sink, but your chances definitely increase.


 1:34 am on May 25, 2007 (gmt 0)

patent 0060294155

"In one aspect, good phrases are phrases that tend to occur in more than certain percentage of documents in the document collection, and/or are indicated as having a distinguished appearance in such documents, such as delimited by markup tags or other morphological, format, or grammatical markers. Another aspect of good phrases is that they are predictive of other good phrases, and are not merely sequences of words that appear in the lexicon"

Bad phrases are simply phrases that are not predictive. As far as I can tell "bad" phrases are ignored. So it doesn't look like bad phrases would not be hand picked.

What I am curious about is what the original document collection would be.


 8:33 pm on May 26, 2007 (gmt 0)

It tends to sound like the original document collection is the entirety of the crawled/indexed documents at the point when the "lists" start to be constructed that make up the co-occurrence matrix.

I'm still pondering the possible connection to site theming though, because I've seen some sites that do drift off theme, noticeable in the site navigation, and looked at on the whole they include anything even remotely related just to grab eyeballs. Like MFA's for example, that would have in a site for jackets pages on dinner jackets, bed jackets and straight jackets - which definitely aren't on theme, and not even related. They aren't topical sites, they're just keyword dumps.

I'm thinking tack to tedster's thread where he mentions engineers intimating that they've been looking at whole sites:

brainstorm: How might Google measure the site, and not just a page? [webmasterworld.com]


 6:18 am on May 27, 2007 (gmt 0)

I can see how Google would want to look at site theming in terms of ranking but actual results seem to be drifting away from it. I'm wondering if phrase based methods have caused this simply because they are not yet refined enough and are giving false positives.

An example I'm seeing in my topic is the sudden change in the top one keyword search results (in singular, plural and ing form). Over the last few days there has been a major shake up and now I find the top 10 results almost all have a connection with "free" in the title or in other aspects of the page.

Now I don't imagine that a real person would decide that free would be in any way connected with widgeting or widgets. But automated phrase based data might have come to this conclusion. Free widget patterns are big in my sector and would have come up a lot in pages on this topic.


 8:14 am on Jun 4, 2007 (gmt 0)

>>I can see how Google would want to look at site theming in terms of ranking but actual results seem to be drifting away from it.

I've been looking at a site that seems to have an OOP, and in trying to figure out what's wrong I've come up with two conclusions. One, is that the site is being way over-optimized (to use the term loosely) for the "main" keyword phrase throughout, instead of being properly constructed for the more specific and targeted related phrases. The number of raw occurrences (not density - something else entirely) is way over the top, and the repetition is eliminating occurrences of the relevant related phrases that would normally co-occur with a site of that type, be more thematic and more accurate to indicate the content.

The problem could easily be solved not only by cutting the raw occurrences down to a fraction of their current level, but by fleshing out the site so that there's more variety of appropriate related phrases for each of the pages and sections.

I'm not saying that phrase-based indexing can be optimized for, IMHO it can't be without being able to control the whole index - so it would be like selling snake-oil, but the concepts can be used to figure out when certain lines have been crossed.

Whether or not this is being used, I don't think I've ever come across documents that give as much insight into the reasoning that goes into search.


 8:34 am on Jun 4, 2007 (gmt 0)

I don't think I've ever come across documents that give as much insight into the reasoning that goes into search.

I'm on the same page, Marcia. And thanks for that post - your points are also very illuminating on why keyword density is becoming pretty much an outdated concept.

This 66 message thread spans 3 pages: < < 66 ( 1 2 [3]
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved