Welcome to WebmasterWorld Guest from 107.20.110.201

Message Too Old, No Replies

"Phrase Based Indexing and Retrieval" - part of the Google picture?

6 patents worth

     

thegypsy

5:28 am on Feb 9, 2007 (gmt 0)

5+ Year Member



I just noticed the thread on relationships of 'Phrase Based' layering in the -Whatever penalties.

This is interesting. That thread seems to be moving in a different direction so I started this for ONE simple area - Phrase Based Indexing and Retrieval (I call it PaIR to make life easier)

There is MORE than the thoughts that Ted started towards, as far as -30 type penalties. I have drudged through 5 of the PaIR related patents from the last year or so and written 3 articles and ONE conspiracy theory on the topic.

Of the more recent inferences was a conspiracy theory with the recent GoogleBomb defused affair.

In specific, from the patent Phrase identification in an information retrieval system [appft1.uspto.gov]:

"[0152] This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) in order to skew the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be "bombed" by artificially creating a large number of pages with a given anchor text which then point to a desired page. As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text. Importing the related bit vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase.

[0153] Each phrase in the index 150 is also given a phrase number, based on its frequency of occurrence in the corpus. The more common the phrase, the lower phrase number it receivesorder in the index. The indexing system 110 then sorts 506 all of the posting lists in the index 150 in declining order according to the number of documents listedphrase number of in each posting list, so that the most frequently occurring phrases are listed first. The phrase number can then be used to look up a particular phrase. "

Call me a whacked out conspiracy theorist, but I think we could have something here. Is it outright evidence that Google has migrated to a PaIR based model? Of course not. I would surmise that it is simply another layer that has been over the existing system and the last major infrastructure update (dreaded BigDaddy) facilitated it. But that's just me

I am curious as to complimentary/contrary theories as mentioned by Ted in the other "Phrase Based Optimization" thread. I simply wanted to keep a clean PaIR discussion.

For those looking to get a background in PaIR methods, links to all 5 patents:

Phrase-based searching in an information retrieval system [appft1.uspto.gov]

Multiple index based information retrieval system [appft1.uspto.gov]

Phrase-based generation of document descriptions [appft1.uspto.gov]

Phrase identification in an information retrieval system [appft1.uspto.gov]

Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]

I would post snippets, but it is a TON of research.. (many groggy hours).. I felt posting WHAT "Phrase Based Indexing and Retrieval" is, would also dilute the intended direction of the thread; which is to potentially stitch together this and the suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented.

Note: There is a sixth Phrase-based patent:
Phrase identification in an information retrieval system [appft1.uspto.gov]

[edited by: tedster at 6:59 am (utc) on May 14, 2007]

MHes

9:08 am on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Lexur - Google has been using semantics for a longtime. Remember the ~widget - widget search that used to show broad match terms? If google took several pages content (a cluster) and treated it as one 'document' then ran their semantics process that was extended to look for 'phrases' they could probably do this at run time. They would then divide and rank the clusters according to which pages bonded according to predictive phrase analysis. I'm also suspicious that they choose the final ranking pages according to user navigation in order to lower the chance of offering an out of date page. They would never have to do this with millions of pages, just the top fifty for a search phrase

[edited by: MHes at 9:10 am (utc) on Feb. 10, 2007]

thegypsy

2:26 pm on Feb 10, 2007 (gmt 0)

5+ Year Member



I would like to reinterate that SPAM DETECTION is merely ONE facet of the system. There is a single Patent alone JUST for 'Spam Detection' in a PaIR system.

IT is NOT s ingular focused layer. It is, or could be, a stand alone IR System. Is it? Highly unlikely as we would have seen it come in

I suspect it has merely been added to existing infrastructure. If you REALLY want to play conspiracy theorist, you could surmise Google used the whole 'GoogleBomb' announcement to 'turn up the dials' on the PaIR infuences on the system in an attempt to keep it 'under the radar'

It is the method of the 'Indexing and retrieval' that is at the core... secondary data sets are then developed from there to deal with issues such as

Wieghting/Ranking
Duplicate content
Spam detection and weighting
Links and link profiles
Snippets (page descriptions)
and so on.....

This is where LSA technologies break down

SO ONCE AGAIN.. it is NOT limited to mere Spam detection...

thegypsy

2:40 pm on Feb 10, 2007 (gmt 0)

5+ Year Member




thegypsy, just as a point of information and to clarify, what is the source of that long quote in msg #3247735. Is it from a write-up of yours or another source?
I'm particularly interested in this part:

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases.

This really helps me understand the approach as an entirety. Whose words and numbers are they?

THat's from the articles I wrote on it. A more simplified break down....

SNIPPET

As you (undoubtedly) remember the core concept of the processing is to identify valid (actual/real) phrases in a given document collection (or web pages in our case). The goal being to classifying each potential phrase as either "a good phrase or a bad phrase" depending on it's usage and frequency; then using those "good" phrases in predicting the usage of other "good phrases" in the collection of web pages.

What's a "Good Phrase"?

The classification for possible phrases as either a good phrase or a bad phrase is when the possible phrase; "appears in a minimum number of documents, and appear a minimum number of instances in the document collection". What that number is, we don't know. Those are the "dials" the Search Gods themselves only have access to. It is almost looking at a Phrase Density over the aggregate of documents (the web site). Also, a BAD phrase is not one with dirty words, it is simply a phrase with too low a frequency count to make the "good" list.

[edited by: tedster at 3:45 pm (utc) on Feb. 10, 2007]

jimbeetle

4:15 pm on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Thanks for the clarification, thegypsy.

TheWhippinpost

9:22 pm on Feb 10, 2007 (gmt 0)

10+ Year Member



Once again, the comprehensive nature of the technology over a simplified model such as LSI is obvious.

I too have expected something along the lines of this phrase theory, though I see it as an extension to an LSI-type algo.

Whereas LSI essentially talks about synomyns of words, this almost lends itself to "synomyns of phrases"

If you were to compare the pages of a product tutorial, a product review, and a typical legit ecommerce product page (we'll assume it's a tech kind of product here), you would most likely see a far higher density of technical language being used in the merchants page, than either of the others.

What's more, the proximity, ie... the "distance", between each of those technical words, are most likely to be far closer together on the merchants page too (think product specification lists etc...).

Tutorial pages will have a higher incidence of "how" and "why" types of words and phrases.

Reviews will have more qualitative and experiential types of words ('... I found this to be robust and durable and was pleasantly surprised...').

Sales pages similarly have their own (obvious) characteristics.

Mass-generated spammy pages that rely on scraping and mashing-up content to avoid dupe filters whilst seeding in the all-important link-text (with "buy" words) etc... should, in theory, stand-out amongst the above, since the spam will likely draw from a mixture of all the above, in the wrong proportions.

Therefore the associated phrases need not be on the same page, but in the cluster of pages and the overall density and frequency valued over the whole cluster.

Most definitely.

jimbeetle

11:12 pm on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Whereas LSI essentially talks about synomyns of words, this almost lends itself to "synomyns of phrases"

I think "synonyms of phrases" gets a bit too far away from the idea. We might be able to get away with saying "related phrases" (because in a way they are), but even that is not quite illustrative of the presence of Phrase A predicts the probability of the presence of Phrase B.

Therefore the associated phrases need not be on the same page, but in the cluster of pages and the overall density and frequency valued over the whole cluster.

This and other points MHes raised are very interesting. Using all the usual possible/maybe/it feels like caveats, this can possibly explain folks comments in other threads of taking hits on pages in one directory. Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

jk3210

11:32 pm on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

For a 3-word "cityname widgets" search, 43 of my pages are grouped at the bottom of the last serps page, mostly ones from the same directory and all interlinked. Of the pages in the group of 43 that AREN'T in that one directory, they are up the directory tree and link to those pages.

Yet, the main page that all those individual pages link to is still #1.

annej

11:36 pm on Feb 10, 2007 (gmt 0)

WebmasterWorld Senior Member annej is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I find the idea that Google may be looking at related phrases in a cluster of pages has interesting possibilities.

Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

In the case where a good inbound link appeared to bring back one of my contents pages within a couple of days I also got back 3 pages that were linked from that contents page. But one page remains missing.

I'm thinking the missing page had too many other "good" phrase matches so the return of the Contents page wasn't enough.

In another section of my site the whole topic is gone, contents page and article pages. Two other sections of my site just have one or two pages missing and there seems to be no association with the contents pages.

TheWhippinpost

2:02 am on Feb 11, 2007 (gmt 0)

10+ Year Member



I think "synonyms of phrases" gets a bit too far away from the idea.

Hence why I quoted it Jim... I used "synomyn" because that's how people generally began to describe the words LSI outputted, when in fact it's more about discovering unique words that are most likely to be found together (a very simple summary, obviously), these aren't synomyns you necessarily find in the dictionary.

So you end up coming out of that process with a few words that one could expect to see when flirting around another particular word... a related word(s), ya might say.

... or a predictive word; or even an expected word!

I've played a lot with the tilde (~) operator over time and know the algo can "relate" a brand-name to a manufacturer-name. It has also learnt to relate acronyms to it, as well as others.

To expand that to encompass a series of words, instead of just one, would be just a maths and computational exercise, I would'a thought... and we all know about BD.

Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

I'd go further than that: The calculation will be made across a cluster of documents it has already judged to fall within the area of interest - which might even be the number of documents it says it found on the SERP.

This won't be just your site, or directory, though clearly if you have a good focussed directory, it would/could figure more dominantly than the opposite case.

[edited by: TheWhippinpost at 2:03 am (utc) on Feb. 11, 2007]

thegypsy

6:49 am on Feb 11, 2007 (gmt 0)

5+ Year Member



Whipping post – strangely it is NOT and addition to any LSA/I technologies. I dare say we may have missed the boat and it never left AdSense/AdWords. This is a standalone method that is far more comprehensive than LSI …. Jim seems to be getting the idea….

Tho following the trail of LSA/I last year did bring me to this.

To keep things moving along a snippet from one of the articles – it delves into term extensions. Connecting words that create phrasings and the PaIR basic model for identification


Phrase Extensions and identification
Phrase extensions are merely additional words on the core term(s). If we had the core term ‘Baseball Cards’ we could ‘extend’ it with ‘Vintage Baseball Cards’, ‘Buy Vintage Baseball Cards’ and finally ‘Buy Vintage Baseball Cards Online’ – you get the idea.
To identify a potential phrase the algo looks at a phrase such as "Hillary Rodham Clinton Bill on the Senate Floor", from which it would take; "Hillary Rodham Clinton Bill on," "Hillary Rodham Clinton Bill," and "Hillary Rodham Clinton". Only the last one is kept. It would also identify "Bill on the Senate Floor" and the inferences down to ‘bill’.

And scoring ranking

In the end it is these related phrase/theme scores that are used in the ranking of documents based on a given search query. The more related phrases and secondary related phrases found in the document for the query phrases would be ranked highest. The semantically topical, relevant page gets the highest ranking.

How about backlinks?

Anchor phrase scoring is also counted in the related query phrase in the text links to other documents. There are 2 scores here being the ‘body’ score and ‘anchor’ score. Greater scoring is obviously given if a good phrase is in the text link as well as on the body of the referenced document. Additionally the anchor text TO your site is also analyzed and scored accordingly under the same methods.

Once again, the PaIR model is FAR more comprehensive in it’s abilities than the LSI model. Was LSA/I used in the Organic SERPs since 2003 (when G purchased Applied Semantics)? Maybe. If this is part of the ‘new’ world, it is one hell of an upgrade….
…and deeper we go….
This 75 message thread spans 8 pages: 75
 

Featured Threads

Hot Threads This Week

Hot Threads This Month