homepage Welcome to WebmasterWorld Guest from 54.204.128.190
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 66 message thread spans 3 pages: < < 66 ( 1 [2] 3 > >     
Phrase Based Multiple Indexing and Keyword Co-Occurrence
Is this all themes with a new suit of clothes on?
Marcia




msg:3336437
 11:39 pm on May 10, 2007 (gmt 0)

Using a timeline as a starting point, in November, 2003 there was a quiet introduction of Google's use of stemming, whereas before they had stated that they didn't use it. That was also the month of the Florida Debacle, the update that shook up the SERPs, and talk about LSI/Latent Semantic Indexing started. In short, according to what I've read LSI isn't feasible for a large scale search engine because first off, as such it's a patented technology, and secondly it's very resource intensive.

LSI also uses single words - terms. However, 8 months later there was a series of patent applications filed by Google that dealt with Phrase Based Indexing. Phrases, not words. Not all those apps have been published yet, but 6 have been. Six, not five.

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

In logical sequence:

Phrases are identified:

Phrase identification in an information retrieval system [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Documents are indexed according to their included phrases:

Phrase Based Indexing in an Information Retrieval System [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Users search to find sites relevant to what they're looking for:

Phrase-based searching in an information retrieval system [appft1.uspto.gov]
Application filed: July, 2004
Published: February, 2006

The system returns results based on phrases, including functions which include generating the document snippets:

Phrase-based generation of document descriptions [appft1.uspto.gov]
Application filed: July, 2004
Published: January, 2006

Use of a partitioned multiple index system to conserve resources and space:

Multiple index based information retrieval system [appft1.uspto.gov]
Application filed: January, 2005
Published: May, 2006

Phrases are used to detect spam documents:

Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]
Application filed: June, 2006
Published: December, 2006

What's interesting about the publication date of that last one is that it attracted widespread attention, and it was only weeks later that an "unofficial" request was put out about reporting paid links. Matt is not only adorable, he's very smart and has an impeccable sense of timing. ;)

There's been plenty of discussion on that aspect of those patent apps, and I'm sure we'd all be delighted if someone wants to start another thread on the topic, including spam, recips, money keywords and Adwords conspiracy theories, but while there have been write-ups about the patents and recaps simplifying what's in the documents, I haven't seen in depth discussion about some of the IR principles that the system embodies. So I think we can all get a better grasp if we discuss those and try to get closer insight into the system.

Keywoord Co-Occurrence
For starters, there are repeated references throughout all those documents to the term co-occurrence. In fact, in just a few short one or two word paragraphs in one of them, the word was used ten times. That seems to be the underlying principle that makes the whole system tick.

What it basically means, at the simplest level, is words or phrases that appear together. The patents go into detail about what's looked for, and statistics on co-occurrence patterns are used to relate clusters of terms/phrases into coherent "themes" and make predictions based on those statistics.

Word Sense Disambiguation
There's always been a problem with contextual relevancy and grasping the intended meaning concerning what a searcher is looking for with words that have several different meanings, or what a document can mean. That's polysemy. Polysemic words are those that are the same but can have more than one meaning. Example: is bow referring to a hair bow, or a bow and arrow or a violin bow?

Really, the only way to be able to tell with ambiguous words would be to look at other words (or phrases) that co-occur - appear with - the word; or in the case of a phrase based system, phrases that co-occur enough across the whole corpus of documents enough to be able to discern the meaning or "theme" of a given page when it uses the word.

There's a difference between word sense disambiguation and word sense discrimination, and it's explained very well here (clear browser cache, it's a PPT presentation):

Powerpoint Demo on Word Sense Disambiguation [d.umn.edu]

The main difference is that one starts out with a pre-defined lexicon (like LSI). Also, I've got copies of the original Applied Semantics patent and white papers (there were 2), and it seems that was also lexicon based. With phrase-basd indexing, it seems that it starts out with a blank check and creates a taxonomy on the fly based on discovering co-occurrences in order to constuct the co-occurrence matrix.

So given that the terminology is used in profusion throughout, it's my feeling that we can benefit by discussing it among ourselves, as well as looking at how the "multiple index" system is set up. Those aspects might well clear up some of the mysteries for us.

Anyone game?

 

Marcia




msg:3339179
 2:13 pm on May 14, 2007 (gmt 0)

This sounds like (if understandable…) a guide to deliver a content that fits G indexing patterns. Are we back in omitting to deliver a site that aims at users and aiming at pleasing almighty G?

Exactly the opposite. It's a chicken and eggs thing.

The phrase lists are based on content that's already on pages that are already out there. First the content that's on pages, then the crawling, then the indexing, then the possible and good phrase lists, then the queries and re-ranking.

Those patents aren't about optimization at all, not even close. It isn't an SEO topic, it's an Information Retrieval topic; there's a difference between the two.

[Just like the good old days of time past when we sat around the campfire at WebmasterWorld and talked about Term Vector Databases.]

[edited by: Marcia at 2:29 pm (utc) on May 14, 2007]

justageek




msg:3339186
 2:21 pm on May 14, 2007 (gmt 0)

How do you expand the phrases and get the co-occurrence data for those individual terms?

For what I was doing I was seeding my searches based off of whatever web page I was doing an analysis of. For example, I'd spider a web page and break it down to the words in the original order.

I'd then start to group the words into as big of sets as I wanted to again keeping them in order. Those sets of words then became my lexicon for the moment.

By going in order it expanded my phrases and since most pages are written by a human it worked very well.

On pages not written by a human, or poorly written pages, the co-occurence falls off drastically as the phrase gets longer and I would score them much lower than others. I'm guessing it's those pages that Google considers spam? I guess I did as well which is why they'd be thrown out of my algo.

JAG

justageek




msg:3339189
 2:24 pm on May 14, 2007 (gmt 0)

The phrase lists are based on content that's already on pages that are already out there. First the content that's on pages, then the crawling, then the indexing, then the possible and good phrase lists, then the queries and re-ranking.

That's exactly it! Works wonders AND works on nearly any language...which makes me wonder how much they use it considering they only support a couple dozen languages for AdSense. I was able to analyze hundreds of languages. Maybe they only support those few languages for financial reasons. Surely it can't be technical can it?

JAG

Marcia




msg:3339238
 3:27 pm on May 14, 2007 (gmt 0)

>>For example, I'd spider a web page and break it down to the words in the original order.

This is different in both method and scope, though. They're not using individual words, and they aren't using web pages - they're using co-occurrence of phrases (phrases that appear together - and related phrases) throughout the entire document collection altogether, which is billions of pages.

By analyzing a page you can see what words are on the particular page, but with this system, by using statistical data on the whole collection of documents they can identify pages as relevant that don't have the exact phrase (or words) but do contain related phrases which are identified via the statistical co-occurrence data.

Matching pages with words can't give what second-order co-occurrence of phrases can, because the latter can identify phrases that are related but don't appear together, except that they do a few hops back with enough frequency.

However - what you've done with examining individual pages can sometimes help to give a clue. If you look at related phrases and words appearing on pages in the top results for unusual type of niche phrases, you can see that pages that aren't ranking often don't have the supporting phrases on the page to indicate that they're relevant for the query, but the ones that are ranking do.

For example, going back to the technical specifications sheet, checking on those searches you can tell that the word sheet doesn't seem to relate to technical specifications. Now, sometimes they're referred to as spec sheets in the more informal vernacular, but apparently not in the published web pages, which would be more formal. So it isn't a good phrase that's recognizable for technical topics.

However, if enough documents were to include technical specifications AND spec sheets on the page, then a connection could be made based on co-occurrence of the phrases.

It's interesting to see the difference between a search for the word sheets with the tilde (query expansion) and without, if you try it both ways, and also to do searches on technical specifications and "technical specifications" in quotes.

[edited by: Marcia at 3:48 pm (utc) on May 14, 2007]

justageek




msg:3339320
 4:38 pm on May 14, 2007 (gmt 0)

This is different in both method and scope, though. They're not using individual words, and they aren't using web pages - they're using co-occurrence of phrases (phrases that appear together - and related phrases) throughout the entire document collection altogether, which is billions of pages.

Ahh...but both methods are the same when you look at what Google has in the patents. We do differ on scope however because I used more than just one index (Google, MSN and Yahoo!). What they've done is they've collected all the words and phrases from their index, that they've collected from web pages with just their crawlers, into a collection of phrases to play with. Either way you look at it we both refer to billions of documents to make groups and decisions. I just chose to use several SERP indexes instead of storing one locally since they all have way bigger machines than I do. And, I would discount single words to nearly zero for any kind of scoring. They obviously have some value so they cannot be discounted completely.

However, if enough of documents were to include technical specifications AND spec sheets on the page, then a connection could be made based on co-occurrence of the phrases.

This is absolutely true...now you know how I found out how to get your IP banned if you smack the engines to hard and fast! I also had to change how I did things because brute force was slow. I did end up changing to a more methodical way of related seemingly unrelated documents which sped up the process and made it even more reliable.

But there was a downside to relating documents in general. You have to know when to stop! Stupid me forgot that. The first time I starting looking at the relationships I amazed. I then pushed the limits on how far you can go and realized there is a point when documents are still related but the relationship is so far apart that I couldn't use it anymore. Getting the distance between documents without saying the relationship to close (100% related) or the relationship was not enough (100% unrelated) drove me nuts!

I gave a shortened version of the entire process so as not to give away everything I did. I'm just saying that what Google has in their patents is a good start for them and I can confirm through real life applications I built, that I know from experience the process they describe does indeed work.

What I don't know is how to make it work to get people better ranking :-/ Not yet anyway.

JAG

Miamacs




msg:3339398
 6:20 pm on May 14, 2007 (gmt 0)

You know I'd be the happiest person if Google was able to do this starting with a "blank record" and gather data on the fly.

But even if they will start it yesteryear, what data will they see?

Have you thought about that?

It's not only a matter of being an authority or not, and how Google would weigh one source to the other, even though that too would be interesting. Nine tenth of the internet is spam documents using AdWords / AdSense / Marketing sensitive language. And the very same language.

Same goes for most of the websites that survived all the updates. Most of the big players have long forgotten not only their graphic intense design and love for dynamic looks, flash, image links, but also their own wording. If you wanted to stay on the net, you HAD to change to the net language to be found, used. And on the other hand, if the website is NOT accessible for analysis, - like most of the best written content - it could bear the best definitions for all english phrases, it just won't change a thing.

...

If they are to analyze what their bots don't drop on spot, what do you think the data they gather would look like at this point. For most - not all but most - of the sites are already using very net-like, and very closed circuit semantics.

I can't imagine how they would determine whether a certain wording wasn't intended to mislead their algo along the bordelines, or try to brand something with pure volume to burn another synapse into the brain of their AI ( Amazon means... Google means... ) Right now... either if you're speaking of volume, or sites being referred to a lot of times... the data is long corrupted.

If they started with a blank record, they wouldn't be scanning websites they reach to build the Google language.

They'd be scanning books instead, documents that have never been optimized to communicate with their robot.

...

Oh wait.
That's what they're doing.

I'd insert my first ever smiley, but don't know what the tinfoil hat looks like in ASCII ...

[edited by: Miamacs at 6:26 pm (utc) on May 14, 2007]

annej




msg:3339611
 11:48 pm on May 14, 2007 (gmt 0)

They're not using individual words

Please clarify this. My impressions was that in the definition of phrase based patents a single word could count as a phrase. But then I'm not used to reading patents.

Maybe Google's objective is to make the cost of optimizing for Google higher than the cost of simply writing or buying good, useful organic content

You have a good point there in terms of the spam patent but it is just one several phrase based patents, just one possible application. What is intended in this thread is to look at the possibility that Google may be setting their entire theming structure using phrase based technology. In other words, are we getting tunnel vision when we just look at this in terms or penalties and filters? I know I was.

They'd be scanning books instead, documents that have never been optimized to communicate with their robot.

I think you have a good point there. It would sure help those of us in academic topics if this were true. I doubt it is though.

I'm thinking this will take time for Google to refine. I'd assume they would always be looking at results and adjusting. Maybe they will eventually see a need doing something like you suggest.

callivert




msg:3339682
 2:01 am on May 15, 2007 (gmt 0)

some questions that have been raised.
niche topics
One concern that I have is that some very niche topics will be so unique that there won't be sufficient data to find the validating co-occurring phrases.

LSA is supposed to solve this problem. In fact, it was invented specifically to solve the "sparse data" problem. That's what it does. It doesn't need a lot of data for any particular word or phrase to be able to put it into the semantic space.
As long as the entire dataset is really big, then rare words and phrases can be located easily.
validity of the data
As for using web-pages versus books and the validity of the data, this is a relatively minor problem. Google can tell the difference between a kick-ass LSA space or a bad LSA space, and building one is not very difficult. It's been done many times by many different groups.
Cost.
As for the technology being patented, well, Google have lots of money.
words versus phrases.
With semantic spaces, words count as phrases. It is just as easy to find a document that's similar to a five word (or ten word) phrase as a single word. And that's true of any arbitrary phrase, even if the phrase itself never occurs in any of the documents.
computational cost
yes, it's computationally costly. However, they should be able to shift most of the cost to the back end, ie the indexing of pages, rather than the retrieval of pages. If anything, it should be faster to retrieve documents if you're just using a vector of 200 numbers to represent every document.

Marcia




msg:3339696
 2:40 am on May 15, 2007 (gmt 0)

>>clarify

annej, what the OP is doing is a completely different thing altogether. And the whole gist is stated in the first sentence that says the search engine uses phrases to index, retrieve, etc. Comparison is made between indexing based on words and using phrases to index by concepts.

>purpose

Actually it's an IR thing rather than an SEO thing, and probably the most important benefit mentioned, aside from relevancy issues, is the multiplied increase in the size of the index capacity.

tedster




msg:3339705
 3:00 am on May 15, 2007 (gmt 0)

As I understand it, annej, you're right that the patents allow for a 1-word "phrase" in their methodology. For example:

The first word in the window 302 is candidate phrase i, and the each of the sequences i+1, i+2, i+3, i+4, and i+5 is likewise a candidate phrase. Thus, in this example, the candidate phrases are: "stock", "stock dogs", "stock dogs for", "stock dogs for the", "stock dogs for the Basque", and "stock dogs for the Basque shepherds".

Phrase Identification patent at the USPTO [webmasterworld.com]

However, the earliest sections of the patent also make clear that the most significant usefulness of this method comes from longer phrases built on each of those single words.

rcain




msg:3339708
 3:12 am on May 15, 2007 (gmt 0)

Hi
Firstly thanks for a really interesting thread.

Secondly, please forgive me if my thought here seem rather simplistic; although my background is in Cybernetics, I'm new to this SE thread and haven't yet read though all the relevant patent applications which may or my not be manifest within our friend Google. However, the subject matter under discussion here suggests to me the value of reconsidering some first principles and how they might be most easily/computably implemented in practice, vis:

- Google's aim is to make searches as well as search results more 'Meaningful' to 'users' (putting aside for a while the spectre of PPC)

- Phrases support more 'meaning' than single words - it must certainly therefore represent an ultimate strategic mechanism of Google to perfect.

- central to 'meaning' is 'context' - context of the search-phrase and context of the found-phrase - this doesnt have to be terribly sophisticated in order to be 'useful' - a simple 'subject-container' type taxonomy can work pretty well (eg. subjects over phrase in para in page in site) - perhaps 'short-term memory' would be the next most useful thing to model where it isnt already implicit.

- it would seem logical/efficient to extend or 'layer' hashed inverted indexing techniques used in 'word' look-up to cover 'phrases' (strings with spaces in) in order to approximate contextual structure.

- statistical/baysian pattern matching algorithms would likely be used in conjunction with thesaurus/dictionary layers - the former being particularly useful for non-text based data and non-native language data (eg. images and composite pages - already used very successfully in certain email spam filters); the latter being implemented as a set of cached self-joins in existing index terms where native language lexicons are available.

- one mans spam is another mans gold - spam is not a type of data but a reduction of 'variety' - ie. low-meaning replication: ergo, to be fair (to providers) & useful (to consumers), SEPR's may need to redefine the phrase 'similar results...' to mean eg. 'other special offers for Product X differing in URL ONLY...'

Anyway, just some thoughts. Conversational Google anyone?

tedster




msg:3339776
 4:48 am on May 15, 2007 (gmt 0)

central to 'meaning' is 'context'

This is addresed by co-occcurence of phrases that are predictive of the actual terms being search, I think

annej




msg:3339787
 5:32 am on May 15, 2007 (gmt 0)

most significant usefulness of this method comes from longer phrases built on each of those single words.

That was my understanding. The method is phrase based but words are looked at in relation to phrases.

mattg3




msg:3339841
 7:08 am on May 15, 2007 (gmt 0)

Sure, but will the bloomy lyrics or prose really represent a tangible topic?

Why not? Maybe not in a technical spec sheet but humans are so much more advanced than a phrase analysis. So much humour is based on exageration and in satire there is information, that the satirised topic is criticised.

statistical/baysian pattern matching algorithms

The satirical posterior is definitly a funny concept ...

mattg3




msg:3339847
 7:25 am on May 15, 2007 (gmt 0)

They'd be scanning books instead, documents that have never been optimized to communicate with their robot.

So is it time then to go to your library find a public domain book in your topic and scan in. It seems so ... OCR is getting popular again. :) And since it's in the public domain use the odd sentence in your texts. No direct match best .. since them sneaky algo people might detect exact matches. Creative writing at it's best. In the moment they might use WP.

Marcia




msg:3340303
 6:35 pm on May 15, 2007 (gmt 0)

"One concern that I have is that some very niche topics will be so unique that there won't be sufficient data to find the validating co-occurring phrases."

LSA is supposed to solve this problem. In fact, it was invented specifically to solve the "sparse data" problem. That's what it does. It doesn't need a lot of data for any particular word or phrase to be able to put it into the semantic space.

But this isn't LSA/LSI which uses Singular Value Decomposition. This is a completely different type of process being described. This isn't using SVD.

Unfortunately LSI has turned into a "generic" word that's being used for anything that even comes close to using semantic factors, but it isn't so; it's a distinct process and isn't generic for everything else. LSI is LSI and other things are those other things, they aren't LSI.

words versus phrases.
With semantic spaces, words count as phrases.

The patents explain the difference between Boolean/term-based system and the phrase-based system, which is pretty clearly indicated in the parts where they're discussing the process for expansion of phrases and incomplete phrases like "President of the." It was very clearly stated how inadequate individual words are as the basis of a system. Individual words "may" be used as phrases but that's only a side issue and a minor point.

yes, it's computationally costly. However, they should be able to shift most of the cost to the back end, ie the indexing of pages, rather than the retrieval of pages. If anything, it should be faster to retrieve documents if you're just using a vector of 200 numbers to represent every document.

That's the opposite of what benefits the patents are pointing out, which is that resources conserved and space requirements are drastically reduced.

LSI has scalability issues - this system is designed to scale. The only thing in common is the use of a couple of IR terms that are used across the board in IR, and we'll get off track if we don't distinguish them. This isn't LSI.

Marcia




msg:3340446
 9:36 pm on May 15, 2007 (gmt 0)

Telcordia LSI Engine: Implementation and Scalability Issues [citeseer.ist.psu.edu](PDF)

If there's any element that's important for Google it's scalability - and effective use of storage space.

annej




msg:3340453
 9:43 pm on May 15, 2007 (gmt 0)

This is a completely different type of process being described.

That's what I thought and why I wondered if this new process is adequately covering niche topics. I imagine it will eventually as I'm sure it is a work in progress, but I'm not so sure if it is there right now.

rcain




msg:3340616
 2:14 am on May 16, 2007 (gmt 0)

Hi Marcia
Instead of conjecturing, i finally got the TIFF viewer working on the us patent office site & took a look at:

ana lynn patterson (fenwick & west)- jan 2006, san jose ca us
Phrase Based Indexing in an Information Retrieval System::

- and what i'm assuming is the start of another conjecture that google have implemented/are about to implement some near version of it - is there ever likely to be any way to be sure?

from a scan of the doc, my understanding is:

---------------------------
'good' phrases occur frequently THEN AND ONLY THEN predict other (good phrases), 'bad' phrases don't (== information gain)

'phrase length' >=2 words, nominally 4 to 5 words (by example)

'phrase' includes all words in (parsing) window - including more usual stop words (eg. the, and,..)

'an interesting phrase' is a 'good' phrase with additional markup

'clusters' are 'good phrases' related by a 'gain' threshold

'cluster naming taxonomy' <== 'name' of the 'related phrase' in the 'cluster' with highest 'gain'

related phrase vectors between documents (marked-up anchor
phrases to body phrases) <=> inbound & outbound scores

'se bomb prevention' by using inter-document phrase gain (rather than linear links frequency) - ie. se now looks for interpage 'relevance'

search ranking according to phrase intersects, related phrase bit vectors and clusters - docs containing most nbr of related phrases of query phrase get highest ranking

clusters further used to favor strongly 'on topic' result sets

(user) phrase 'history' can be used to 'focus' navigation

possible taxonomy generation for presentation or search results

better identification and filtering of (near) duplicate pages
---------------------------------

.....the implementation looks a little 'clunky' (though effective) and as you suggest computationally heavy, though i would have thought with Google's annual turnover in excess of US$8bn disk, ram & mips would be a comparatively cheap resource to throw it such a project.

....i'm still pondering the impact on site structure and content design & sem for my customers, and i'd be very interested to see some experimental/user comparisons/benchmarks against say bio-medical data vs satirical poetry - posterior or otherwise.

.... i wonder if well end up with so many 'user parameters' to tune we'll forget what it was we were searching for in the first place, and miss all the fun of the red-herrings on the way?

Pass the Dutchie




msg:3340809
 10:24 am on May 16, 2007 (gmt 0)

'sbs 2003' returns results that highlight results for 'Small Business Server 2003'. Google seems to also show results that show a preferance for the term 'Small Business Server 2003'. Clever but what if I don't want results for 'Small Business Server 2003' but for the term I actualy searched for.

Any theories as to how G determins how to rank results for these two terms? Popularity of serach term? Click through?

Miamacs




msg:3340915
 1:08 pm on May 16, 2007 (gmt 0)

allinanchor:sbs 2003

...

Also, check the AdWords suggesntions.
You don't have to, I already have. Yes, you guessed it right.
Google knows about the relation from many sources.

Actually it's the most common relation.

sbs 2003 -small -servers -business -server -microsoft -windows -win

...was the shortest query that filtered them out for me.
Which is funny. If you enter -server, the results will still be happy to show "servers".

Silvery




msg:3341299
 7:10 pm on May 16, 2007 (gmt 0)

It seems likely to me that some variety of categorical assignment must be in use at Google at the moment, regardless of how expensive the semantic analysis could be.

I believe it highly likely that Google must be associating websites with category topics or "concepts" on site-wide basis, instead of on a page-to-page basis. Identification of site-wide themes would be far, far more efficient than identification on a page-by-page basis.

Marcia, would this satisfy the objection that LSI would simply be too process-expensive to be in operation?

In Matt Cutt's recently-updated post on "how to report paid links":
[mattcutts.com...]

He provides an example of ads appearing on a Linux-subject site -- the links promote casinos, jewelry, gambling and ephedra. In reference to these, he states:

Our existing algorithms had already discounted these links without any people involved.

Methods to automatically identify such linkings would likely include comparisons on whether the sites are semantically related or not.

There are a number of methods that they could be using for categorizing websites according to their concepts, of course, but one of the signals that I believe is being used is analysis of frequency of terms found within the site itself, and those terms' association with the concepts. (naturally, they would likely ignore the terms found in text of links pointing to external sites for the purposes of this analysis)

One piece of evidence that might support my theory is that Google is apparently tracking the top words found throughout a particular site, and they report those top terms in the reports found via Webmaster Tools.

Why else would they track site-wide terms if not to use that information for some sort of analysis to help identify the themes/concepts of each site?

Marcia




msg:3341301
 7:10 pm on May 16, 2007 (gmt 0)

There's little difference now between singular and plural, but that's simple, common sense stemming and a good thing so people don't have to page-stuff sites and create different pages to optimize for the plural and singular versions of a word.

But a stemmed word that's the same isn't the same thing as related phrases, and neither are synonyms.

tedster




msg:3344967
 12:49 am on May 21, 2007 (gmt 0)

I have had some success for two clients in getting pages to rank better on Google by broadening the variety of semantically related phrases on the page. I'm not talking about other phrases built on the same root words or just the use of different stemming, but rather, totally different words and phrases, altogether new character strings that are related in meaning and likely to co-occur; phrases like "White House" and "presidential politics".

This approach is in direct opposition to the "old way" of writing content -- trying for high keyword density and so on. It also is not the approach most people take who feel they are suffering from a phrase based spam penalty (minus 950 penalty). Many times they work to lower co-occurence.

I can't claim any big global insights yet, especially because my sample size is so small. But what I have seen is very suggestive.

Marcia




msg:3344978
 1:04 am on May 21, 2007 (gmt 0)

>related

I've seen a case where a page popped out in a few days after a backlink was added, BUT it was a measured test and not only was a backlink added, there were many relevant, on-topic related phrases already on both the site and the page with that outlink (that site and page had both been there for a while), a couple of phrases were in close proximity to the new out link that were related to the "word" in the problem phrase that was causing the problem.

And there are related phrases that "co-occur" in sites linked out to and inbound links for the site that are right on target for the topic.

Related phrases are not the same as synonyms, and extended phrases can present a different context as far as the potential for information gain is concerned.

Another point is that from what I've read it's my understanding is that the ~tilde operator isn't for synonyms exclusively as it's often been interpreted, technically it's for query expansion, and there's a difference there also.

Added:

Please feel free to disagree and show why it's incorrect, it would be welcomed on my part; but IMHO the whole primary factor in the phrase-based approach is based on keyword co-occurrence. It's just mentioned too many times in too many context within all of those papers not to be. And that's the reason we need to delve into and try to grasp what these patents are indicating.

[edited by: Marcia at 1:24 am (utc) on May 21, 2007]

annej




msg:3348128
 1:20 am on May 24, 2007 (gmt 0)

Well I finally finished the two big projects that have been keeping me from reading patents. But today they are done and all that was left was to clean my neglected house or read patents. Well, given that choice what would anyone do" ;)

I am just beginning to get some impressions and I certainly don't completly understand all I read in them. But I want to see if I'm on the right track so here is what they seem to be indicating.

Through an involved process they would sort out phrases between good phrases and bad phrases. Good phrases are those that are predictive of other good phrases while bad phrases are not useful in terms of prediction. For example bad would include very infrequent phrases, common sayings that could be in a document on many topics like "that's the way it goes" as well as for other reasons.

There are many refinements in this but in the end the process would determine if a document included more than one phrase that would predictably be seen together in a given topic. That would be matched with the query phrase. (not word for word but more on a conceptual level).

Certain phrases in a document would be given more weight including phrases that are bolded, in quotes and in anchor text. Both ingoing and outgoing anchor text would be considered.

Somehow with this mix of information documents would be ranked in relationship with the query.

This process would actually take less computer space etc than the traditional way of considering every word on a page. Instead only qualified phrases would be considered.

I have questions and thoughts on all this but for starters I need to know if I'm even on the right track on how this would work. Any thoughts or clarifications anyone?

tedster




msg:3348270
 4:49 am on May 24, 2007 (gmt 0)

That sounds like the way I understand it, annej.

My main question right now is if this approach would need supplementary support for idioms and bigrams (two words in text but one unit of meaning). My guess is that with a large sample, co-occurence of good phrases alone could theoretcially carry the day, but I'm suspecting that this would be too academic and pure for the real world.

I'd love to get a look at this kind of phrase data - even just a sampling!

mattg3




msg:3348665
 1:54 pm on May 24, 2007 (gmt 0)

I have questions and thoughts on all this but for starters I need to know if I'm even on the right track on how this would work. Any thoughts or clarifications anyone?

Data needs models to be efficient. It's all fine knowing the data (aka the information they give in the patent) but without the actual model attached to it, you can read all the patents in the world and you won't understand what is actually happening. If you really want to read, I guess, you need to read AI and mathematical papers and watch the techtalks. But is it worth the time investment?

But if your up for it you can of course get deep into a corpus-based approach for building semantic lexicons or relational learning of pattern-match rules for information extraction. :)

But if you read all that you still don't know which weights they give to what and which one they use and how efficient it really is.

mattg3




msg:3348699
 2:09 pm on May 24, 2007 (gmt 0)

I'd love to get a look at this kind of phrase data - even just a sampling!

A better approach would be with increasing complexity of their algorithm is to do a bayesian model based on the SERPS.

An ad hoc simplistic approach might be to use something like spamassasin and train that on the serps that are hop or top and then let your pages you try to seo run through this. It might not match the Google algorithm, but will tell you what passes this test.

Rich SEOs might employ mathematians or computer scientists to engineer something that identifies successful from unsuccessful pages.

But I think the read a patent and know what's going on days are over. :)

If i'd do SEO that's what what i'd do. If you can't get the key to the lock pick it with something else.

annej




msg:3348738
 2:58 pm on May 24, 2007 (gmt 0)

I agree that the information we are looking at here would not be that useful for any specific SEO. I think this discussion is more general. We are considering if Google is shifting to a dramatically different method of search. There have already been indications that phase based methods are at least in play in specific situations.

Tedster, my impression is that in theory supplementary support would not be needed. The algo would just keep refining. Also a lot of words and phrases would be ignored. This goes back to my earlier concern. Would pages in some extremely niche topics be excluded because they would not have the predictive phrases necessary according to the algo?

But then I'm sure this would be a slow process to implement with a lot of adjustments along the way. There will be a collateral damage and we as webmasters will be frustrated with it at times. It wouldn't be like the original Google process where inbound links could instantly change the nature of search.

So even though we don't know specifics there are some things I am curious about. For starters I'm wondering what sampling Google would be using to sort out the good and bad phrases with all the refinements mentioned in the patents. Or perhaps it wouldn't be a sampling at all but an analysis of all the information they have gathered from the internet.

mattg3




msg:3349267
 1:13 am on May 25, 2007 (gmt 0)

So even though we don't know specifics there are some things I am curious about. For starters I'm wondering what sampling Google would be using to sort out the good and bad phrases with all the refinements mentioned in the patents. Or perhaps it wouldn't be a sampling at all but an analysis of all the information they have gathered from the internet.

Where would one get bad phrases from as they would go into infinity. So I guess they grabbed their favorite server WP and maybe some other dictionaries. Whatever isn't on them might be bad phrases.

Besides some specific words like SEO, WAREZ, afilliate and so on, I wouldn't know how else you would get a list of words and phrases that would be unrelated.

I think these filters are definitely at work, how else could a wikipedia snippet rank on page 1 (having passed the extrenal duplicate content filter as it's diluted with boilerplate content), while a completely valid school teacher script with 20 times more info is 950ed. The more detail and the better text and the more background info that the simplistic WP doesn't have the more likely an author is to have a bad phrase. An extensive professional script is more likely to hit a filter than all the lexika entries that I see now in the German SERPS.

Given the tabloid formula of life any extensive information would also increase your bounce rate as users want short answers yesterday and Google does what users want and don't seek for quality.

That doesn't mean a long article is 100% cert to sink, but your chances definitely increase.

This 66 message thread spans 3 pages: < < 66 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved