homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 75 message thread spans 3 pages: < < 75 ( 1 [2] 3 > >     
"Phrase Based Indexing and Retrieval" - part of the Google picture?
6 patents worth
thegypsy




msg:3247209
 5:28 am on Feb 9, 2007 (gmt 0)

I just noticed the thread on relationships of 'Phrase Based' layering in the -Whatever penalties.

This is interesting. That thread seems to be moving in a different direction so I started this for ONE simple area - Phrase Based Indexing and Retrieval (I call it PaIR to make life easier)

There is MORE than the thoughts that Ted started towards, as far as -30 type penalties. I have drudged through 5 of the PaIR related patents from the last year or so and written 3 articles and ONE conspiracy theory on the topic.

Of the more recent inferences was a conspiracy theory with the recent GoogleBomb defused affair.

In specific, from the patent Phrase identification in an information retrieval system [appft1.uspto.gov]:

"[0152] This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) in order to skew the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be "bombed" by artificially creating a large number of pages with a given anchor text which then point to a desired page. As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text. Importing the related bit vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase.

[0153] Each phrase in the index 150 is also given a phrase number, based on its frequency of occurrence in the corpus. The more common the phrase, the lower phrase number it receivesorder in the index. The indexing system 110 then sorts 506 all of the posting lists in the index 150 in declining order according to the number of documents listedphrase number of in each posting list, so that the most frequently occurring phrases are listed first. The phrase number can then be used to look up a particular phrase. "

Call me a whacked out conspiracy theorist, but I think we could have something here. Is it outright evidence that Google has migrated to a PaIR based model? Of course not. I would surmise that it is simply another layer that has been over the existing system and the last major infrastructure update (dreaded BigDaddy) facilitated it. But that's just me

I am curious as to complimentary/contrary theories as mentioned by Ted in the other "Phrase Based Optimization" thread. I simply wanted to keep a clean PaIR discussion.

For those looking to get a background in PaIR methods, links to all 5 patents:

Phrase-based searching in an information retrieval system [appft1.uspto.gov]

Multiple index based information retrieval system [appft1.uspto.gov]

Phrase-based generation of document descriptions [appft1.uspto.gov]

Phrase identification in an information retrieval system [appft1.uspto.gov]

Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]

I would post snippets, but it is a TON of research.. (many groggy hours).. I felt posting WHAT "Phrase Based Indexing and Retrieval" is, would also dilute the intended direction of the thread; which is to potentially stitch together this and the suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented.

Note: There is a sixth Phrase-based patent:
Phrase identification in an information retrieval system [appft1.uspto.gov]

[edited by: tedster at 6:59 am (utc) on May 14, 2007]

 

thegypsy




msg:3249534
 10:55 pm on Feb 11, 2007 (gmt 0)

OK, so why did you say in your intro that you wanted.... "suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented." Or am I reading you wrong? What do you want in this thread? I think a discussion without reference to 950 is a bit pointless.

Actually my point was to tie in the PaIR discussion or more sepcifically further it. I case you hadn't noticed, -30, -950 , -whatever threads don't go far and as I mentioned, asily explained statisctically with some 24 billion pages. Just not enough there to warrant serious consideration for the professional SEO ...

What I have attempted to do was bring new methodolgies/technologies to light so others can begin to embrace them. There is FAR more evidence of PaIR methodologies being in play then there is for -whatever penalties (or a definative desciption of the sandobx even, I dare say)

What one should be doing is trying to understand the implications of such technologies on the various facets of SEO ( link profiles, on page work, themes, internal linking prominence and the like).
I am in the business of SEO. While I have a keen interest in Search Egineering, deseminating the Google algo for purposes of whining how it SHOULD be done... is not part of my job desciption

I prefer to turn analysis into profits, not hyperboil

MHes




msg:3249831
 7:35 am on Feb 12, 2007 (gmt 0)

>There is FAR more evidence of PaIR methodologies being in play then there is for -whatever penalties

OK, please can you expand on that. I know it is difficult without specific examples but what is the evidence and the seo implications with reference to:

1) link profiles
2) on page work
3) themes
4) internal linking prominence

annej




msg:3250859
 4:15 am on Feb 13, 2007 (gmt 0)

The "Detecting spam documents in a phrase based information retrieval system" patent application talks about "comparing the actual number of related phrases in a document with the expected number of related phrases"

It just occurred to me that I have been assuming it has to do with density but now I'm wondering if it is talking about an absolute occurrences of the phrases. Since we don't know what the phrases are would it help to break articles into two pages in hopes of not going over the limit of related phrases?

I don't know if that even makes sense. I'm grasping at straws here.

Marcia




msg:3250862
 4:36 am on Feb 13, 2007 (gmt 0)

The "Detecting spam documents in a phrase based information retrieval system" patent application talks about "comparing the actual number of related phrases in a document with the expected number of related phrases"

The indexing and comparing of phrases, or "phrase based indexing" is an implementation, with the particular applications and methodologies being described in the patents.

But phrase-based indexing is just that - an implementation, and is nothing more than an implementation that's a part of broader technologies.

IOW, "phrase based indexing" as such isn't a technology in itself, it's just the way the technologies are being implemented in these inventions.

[edited by: Marcia at 4:40 am (utc) on Feb. 13, 2007]

thegypsy




msg:3251201
 1:01 pm on Feb 13, 2007 (gmt 0)

Marcia - I fail to see the point nor it's relevance?

jk3210




msg:3254687
 5:45 pm on Feb 16, 2007 (gmt 0)

Okay, so for Google to accept these current results (which have valuable pages pinned to the bottom of the serps), there must be an overriding benefit for them to allow it to continue. What could that benefit be?

As I read the Phrase-based indexing patents, it seems to me that (in very basic terms) one of the things Google is trying to do is to ascertain which pages "discuss a subject" and which pages merely "target a phrase." IOW, which pages are attempts at producing/discussing information and concepts, and which pages are merely a collection of words whose only purpose is to act as a platform to hang targeted key-phrases on. All else being equal, that would be valuable capability for a search engine to have, right?

But, what is the *overriding* benefit motivating Google to accept the considerable colateral damage to potentially valuable pages? Well, wouldn't the ability to defeat *content generators* be worth the cost considering that Google has to wade through billions of auto-gen'd crap every day?

The system is further adapted to identify phrases that are related to each other, based on a phrase's ability to predict the presence of other phrases in a document. More specifically, a prediction measure is used that relates the actual co-occurrence rate of two phrases to an expected co-occurrence rate of the two phrases. Information gain, as the ratio of actual co-occurrence rate to expected co-occurrence rate, is one such prediction measure.

Why would Google want to *predict* the presence of related phrases? Why would they care? A phrase is either present or not present, but why try to *predict* it's presence? What's the point?

Well, don't scraper pages have the targeted phrase repeated over and over an inordinant number of times? And wouldn't that repetition, WITHOUT the presence of related phrases that were *predicted* to be present, skew the information gain ratio for a scraper page?...maybe enough to allow them to be identified and zapped?

In greatly simplified terms, if the phrase based indexing system determined that for the phrase [foo city hotels] --which had been deemed a "good phrase" --the phrase [foo city convention center] SHOULD also be present, and zapped all pages that didn't contain [foo city convention center], would a 99.9% accuracy/success rate against scrapers, content generators and other low value sludge be worth the negative effect of wrongly sending .1% of the valuable pages to the 950 range?

Just a thought.

Marcia




msg:3254801
 7:29 pm on Feb 16, 2007 (gmt 0)

thegypsy:
Marcia - I fail to see the point nor it's relevance?

The point is just what was said. If it's in use it's a part of the bigger picture, including use of some of the factors being among the "over 100 factors" that have already been in use. This is a sophisticated and innovative process to be sure, but it's founded on technologies that are its predecessors, some common in IR for a number of years. That can be clearly seen by the language, terminology and concepts being described.

But now let me answer your question with a question:

Have you discerned and recognized the basic, classic IR technologies that underlie this group of apps and that preceded it?

thegypsy




msg:3255414
 1:46 pm on Feb 17, 2007 (gmt 0)

Why would Google want to *predict* the presence of related phrases? Why would they care? A phrase is either present or not present, but why try to *predict* it's presence? What's the point?

It’s not trying to use ESP.. it’s more a case of ;

‘Statistically we’d expect to see X# of inferences of ‘good phrases;’

More of the algo having an ‘expectation’ and ‘threshold’ for given related core terms. So the scraper page would have an unusually high occurrence and concurrence rate for the core and related phrases.

Further more, pages aren’t ZAPPED – pages obtain added weighting for each of the ‘expected’ page scores it satisfies. It is not about what ‘SHOULD’ be present. Once again it is not establishing the worth of a page, simply a statistical commonality among ALL pages in that themed group within the index.

Also 2 things;

1.Existing ranking methods are still in play (in reality or theoretically)
2.Links still play a large roll in the ultimate ranking in the PaIR method, this has not been accounted for in your thinking


The point is just what was said. If it's in use it's a part of the bigger picture, including use of some of the factors being among the "over 100 factors" that have already been in use. This is a sophisticated and innovative process to be sure, but it's founded on technologies that are its predecessors, some common in IR for a number of years. That can be clearly seen by the language, terminology and concepts being described.
But now let me answer your question with a question:
Have you discerned and recognized the basic, classic IR technologies that underlie this group of apps and that preceded it?

As I have mentioned MORE than a few times it is simply (if at all) a layering onto the existing infrastructure and algorithmic operations. I would personally be turning the dials slowly to make the PaIR methods more prominent over the next year, but who knows that one….

I know my SEO history back to 1995 and have been in the web biz game since 1998 – Indexing and retrieval my studies go back to the early 90s – So, I certainly have a good understanding of the road traveled up to this point.
This is actually the tail end of the PaIR discussions for me… I started digging into this and writing about it last fall…. My fishin head is spinning… he he..

I am always studying/researching something, ‘Personalized Search’ technologies recently captured my eye… ( Wolfie did ‘Local Search’ to death recently)…

So yes Marcia, my name is David…… and I.. am an ‘Aldo-holic’

steve40




msg:3255435
 2:24 pm on Feb 17, 2007 (gmt 0)

thegypsy

Well I would like to thank you for your valuable contribution and your willingness to share your wisdom

Without yours and others contributions I could possibly read some of the many patents G has till i am blue in the face and still be none the wiser so I thank you for taking the time to put it / them in a simpler form

I think from this and your expectations of increased use of this LSI / Phrase Based Indexing and Retrieval and a couple of other algo changes that we are just seeing the start of implementation the current top 20 serps results we currently see could well be very different with more than the .1% collateral damage others have predicted my own view possibly up to 5%

I think we may be looking at the next generation of SE Algos as where G has Gone the others will surely follow

Just one final comment as this becomes more refined it could mean that page content does become more important than links over the next 12 months which in my opinion will be a good thing

steve

SincerelySandy




msg:3255454
 3:00 pm on Feb 17, 2007 (gmt 0)

thegypsy
The classification for possible phrases as either a good phrase or a bad phrase is when the possible phrase; "appears in a minimum number of documents, and appear a minimum number of instances in the document collection

I thought that a "good phrase" was classified as a phrase that could be used to predict the occurance of other phrases. You seem to be saying that a "good phrase" is determined by the number of times that phrase appears in various places. Am I understanding you correctly?
a BAD phrase is not one with dirty words, it is simply a phrase with too low a frequency count to make the "good" list.

I thought that a "bad phrase" was simply one that could not be used to predict the occurance of other phrases?

annej




msg:3255464
 3:23 pm on Feb 17, 2007 (gmt 0)

Links still play a large roll in the ultimate ranking in the PaIR method

Actually this has been discussed by several of us in earlier threads. I think this has always been true with many of the various filters over the years. They have to be strong and somewhat related links though.

Sandy, the way I understand it a good phrase is one that would be predictive but the actual penalty, filter or whatever is based on the density of these phrases. So the phrases are already predetermined then at the time the page is spidered the density of related phrases is calculated. At least that is how I read it.

thegypsy




msg:3255482
 3:50 pm on Feb 17, 2007 (gmt 0)

Anna... U R getting there fast.... nice work!

1. Links - precisely as U have surmised. It would add weighting to the existing link profile based on what PaIR factors have been satisfied. So relevance of the text AND the destination and host page come into play.

As you may imagine the outbound links and inlinks (internal links) also get treatment. For outbound links it looks at the anchor text and compares it against the &#8216;good list&#8217; and scores it accordingly. It also checks the document (web page) of the target site against the good list further accreditation is given. Partial scoring also comes into play if, for example, the target document has &#8216;Australian&#8217; but not &#8216;Australian Football&#8217;. While not a complete miss, it wouldn&#8217;t get FULL marks.

Anchor phrase scoring is also counted in the related query phrase in the text links to other documents. There are 2 scores here being the &#8216;body&#8217; score and &#8216;anchor&#8217; score. Greater scoring is obviously given if a good phrase is in the text link as well as on the body of the referenced document. Additionally the anchor text TO your site is also analyzed and scored accordingly under the same methods.

2. ESP - once again let's not get caught up in the 'semantics' of the 'predictive' aspect. Like I said, think of it more as 'expectation' than predictive. Not to over simplify, but a 'phrase density' outlook captures the essence better.

The proposed identification process begin as such;

Collect possible and good phrases, along with frequency and co-occurrence statistics of the phrases.
Classify possible phrases to either good or bad phrases based on frequency statistics.
Prune good phrase list based on a predictive measure derived from the co-occurrence statistics

So it basically has 2 filters to further refine the list of &#8216;good phrases&#8217; to identify the strongest elements of the site or what could be loosely described as a theme.

[edited by: tedster at 6:59 pm (utc) on Feb. 17, 2007]

SincerelySandy




msg:3255491
 3:59 pm on Feb 17, 2007 (gmt 0)

a good phrase is one that would be predictive but the actual penalty, filter or whatever is based on the density of these phrases

My impression is that the filters associated with (phrase based indexing) are not looking at the density of "good phrases" {that's just keyword density and they've already got filters that look at KW density). It seems to me that the filters are looking for occurances of the other words and phrases that would be associated with the "good phrases", and if not enough of these "other phrases" are found then the filters are applied. So if you have a bunch of different "good phrases" on a page, but the page does not contain some of the other words and phrases that the "good phrase" is indicative of, a filter catches it. Similarly, if you only have one "good phrase" that is used to many times on a page, and that page does not contain some of the other phrases that the too frequently repeated "good phrase" is indicative of, a filter catches it.
Just a guess. Are we saying the same thing in a different way AnneJ?

jk3210




msg:3255535
 4:56 pm on Feb 17, 2007 (gmt 0)

Further more, pages aren’t ZAPPED – pages obtain added weighting for each of the ‘expected’ page scores it satisfies

Yes, I understand that. "ZAPPED" [Re: content generators] was a succinct way of stating my opinion as to what their ultimate goal would be, which might explain WHY they would include this analysis as part of the algo --if in fact it is or ever will be included.

You never address the question of "WHY." What would be the purpose? Would they be doing this analysis as a science project, or would they have a purpose for it?...and if so, what is it?

jimbeetle




msg:3255578
 5:51 pm on Feb 17, 2007 (gmt 0)

thegypsy, while I appreciate your contributions to my understanding, I have to once again ask you to source your quotes. It really helps to know whether it's somebody else or you talking.

Oliver Henniges




msg:3255627
 6:37 pm on Feb 17, 2007 (gmt 0)

The universe of "all posible phrases" is gigantic, even for three-word-phrases and even for one single language. To me the key-issue seems to be those mechanisms, by means of which google narrows down this mass.

If I understood the patent correctly, this is all done "on the fly", whilst crawling, evaluating and indexing a certain bunch of a couple million pages on the web. At least the spam detection patent is NOT applied to the whole index in one big loop. How is this subset of a few million pages preselected? By accidence and link structure in the normal crawl?

It is impossible to intermediately store the co-occurance matrix, unless you concentrate on a core of a few thousand most-spammy keywords and phrases.

Again: If we want to proceed towards a closer understanding (and perhaps simulation) of the mechanisms at work, it is essential to narrow down the problem to a level computable on a normal PC.

If I'm completely wrong with this, please enlighten me about the passages I overread.

jimbeetle




msg:3255649
 7:17 pm on Feb 17, 2007 (gmt 0)

unless you concentrate on a core of a few thousand most-spammy keywords and phrases

Hmmm, infamous Florida update when big bunches of real estate and travel sites got hit?

Those notably spammy sectors would have been a good place to start to test the waters.

annej




msg:3255678
 7:48 pm on Feb 17, 2007 (gmt 0)

Anna... U R getting there fast.... nice work!

Actually I can't take credit, it was brought up by other folks in the recent discussions.

filters are looking for occurrences of the other words and phrases that would be associated with the "good phrases", and if not enough of these "other phrases" are found then the filters are applied.

I'm looking at the claims section of patent application 0060294155 (detecting spam in phrase based)

Here it is in my words (which may be wrong but I'm trying to get it right)

It is worked out what phrases tend to be seen together in a naturally written document. (road building plus laying asphalt) At least I am guessing this is how they decide on the number of related phrases that are expected to be present. This is compared with the balance of these phrases in a typical spam document.

So when a page is spidered the accepted phrase ratio is already set. So if the page has one phrase more often than is acceptable it is filtered out or penalized. This is why I keep saying it's a thin line between ranking well and plunging in the serps.

So if you have a bunch of different "good phrases" on a page, but the page does not contain some of the other words and phrases that the "good phrase" is indicative of, a filter catches it.

Good point. I've been concentrating on the other aspect because I'm working with articles. With them there is usually a good variety of related phrases so I've been assuming it would help to get rid of any excessive phrases. So I've used a bit tunnel vision there and need to think about the other aspects.

unless you concentrate on a core of a few thousand most-spammy keywords and phrases

That's why I've been noticing the ads that Google shows on their result pages when I search various phrases. Also I've noticed what words the scraper sites use. This gives me an idea of what words or phrases are causing the problem on a page that has dropped severly in the serps. I've done this in relation to my general topic but suspect it would be useful with other topics as well.

tedster




msg:3255693
 8:08 pm on Feb 17, 2007 (gmt 0)

So if the page has one phrase more often than is acceptable it is filtered out or penalized.

However, the frequency for what is "acceptable" is still something that "significantly exceeds the expected number".

[0217] A spam document may be indicated if the actual number N of related phrases significantly exceeds the expected number E, for some minimum number of good phrases. In one implementation, N significantly exceeds E where it is at least some multiple number of standard deviations greater than E, for example, more than five standard deviations. In another implementation, N significantly exceeds E where it is greater by some constant multiple, for example N>2E. Other comparison measures can also be used as a basis for determining that the actual number N of related phrases significantly exceeds the expected number E. In another embodiment, N is simply compared with a predetermined threshold value, such as 100 (which is deemed to be maximum expected number of related phrases).

[0218] Using any of the foregoing tests, it is determined whether this condition is met for some minimum number of good phrases. The minimum may be a single phrase, or perhaps three good phrases. If there are a minimum number of good phrases which have an excessive number of related phrases present in the document, then the document is deemed to a spam document.


Oliver Henniges




msg:3255711
 8:49 pm on Feb 17, 2007 (gmt 0)

> ...because I'm working with articles. With them there is usually a good variety of related phrases...

My sphere is the other end: Product pages with relatively thin content. Hardly any phrase related to the term "widget" will be likely to occur on the page at all.

So if you have a bunch of different "good phrases" on a page, but the page does not contain some of the other words and phrases that the "good phrase" is indicative of, a filter catches it.

Which definitely is not the case on most of my product pages. Nevertheless my pages still do fine in the SERPS (though I admit its a niche market). So all in all evidence, that at least the spam detection patent is quite likely to be applied to only a strongly selected set of phrases. Which is only natural: It is limited to spammy phrases.

But for me the question remains open, to what extent phrase-based IR can be applied to such an enormous amount of words and their possible combinations in an environment like the internet.

thegypsy




msg:3255721
 9:17 pm on Feb 17, 2007 (gmt 0)

If it's articles or product pages it doesn't really matter as U are competing with LIKE pages...

The whole PaIR concept changes VERY LITTLE in what we already know and how we operate... it simply may 'explain' a few things better...

Your blue widget page competes against MY blue widget page
I put a 'target' page for my widgets with info and resources, I would rank higher than U.

That is not any different.

I have more quality backlinks than U
I would rank higher ( high quality meaning the site/authority, page relevance and text of the BL in question)

That is relatively unchanged (following the current theory)

Quality content and Quality backlinks simply take an enhanced position. THe core ranking fundamentals are not unlike what we have/are doing all along...

While it is a new method, it still is layered on and inspired by current algorithmic attempts at providing 'relevant' results at the Big G....

I wouldn't get worked up over hte Spam Detection aspects as in such a system I would see LESS collateral damage. Getting snagged in a PaIR Spam index is unlikely.

Also, MOST penalties are a matter of statisfying more than ONE web spam SNAFUs... ie; KW Stuffing + Aggregate Dupes + Link SPam... so consider the PaIR Spam detection as merely a layer not a stand-alone 'catch-all' solution....

[edited by: thegypsy at 9:19 pm (utc) on Feb. 17, 2007]

jk3210




msg:3255722
 9:18 pm on Feb 17, 2007 (gmt 0)

@thegypsy
Never mind my previous post, I think I've got it now. :)

tedster




msg:3255756
 11:33 pm on Feb 17, 2007 (gmt 0)

One factor that can help minimize false positives, as I understand it at least, is the fact that the "expected number E" will be measured relative to each target phrase, and across quite a wide sample of documents -- so it won't be nearly the same number in the case of diverse search phrases.

[0220] For each of these most significant related phrases, the number of related phrases present in the document is determined, again from their related phrase bit vectors. If the actual number of related phrases significantly exceeds the expected number (using any of the above described tests), then document is deemed a spam document with respect to that most significant phrase...

I would imagine that "thin" pages would scoot right by this test, whereas article pages written with certain target searches in mind might trip the spam test. Do others see it this way?

While it is a new method, it still is layered on and inspired by current algorithmic attempts at providing 'relevant' results at the Big G....

That's how I see it, too... the following quote from the patent seems to say the same thing, although it appear to contain a typo.

[0221] The foregoing approaches to identifying a spam document are preferably implemented as part of the indexing process, and may be conducted in parallel with other indexing operations, are afterwards.

Say what? It only makes sense to me if I read the last phrase as "or afterwards."

At any rate, I don't assume that the 950 phenomenon can be wholly explained by phrase based techniques. The kind of impact on ranking to be expected is highlighted by two examples in the spam patent - and to my understanding, neither of these two steps would send every tagged url to the end of results. Of course, nothing in the patent requires that only these two step are possible.

[0223] If the document is included in the SPAM_TABLE, then the document's relevance score is down weighted by predetermined factor. For example, the relevance score can be divided by factor (e.g., 5). Alternatively, the document can simply be removed from the result set entirely.

However, the frequently mentioned "over optimization penalty" or OOP does seem that it could be accounted for with these approaches.

[0220] ...The document is also added as a spam document for each the related phrases of that good phrase, since a document is considered a spam document with respect to all phrases that are related to each other.

Note that this does not seem to be what 950 sufferers are describing. For at least some of these cases, related phrases still can rank well.

Oliver Henniges




msg:3255767
 12:00 am on Feb 18, 2007 (gmt 0)

> Say what? It only makes sense to me if I read the last phrase as "or afterwards."

Yes, indeed, the patent paper seems to be written a bit hastily all in all, but who am I to jugde about google's patents.

As I mentioned in [webmasterworld.com ] I got stuck between 0036 and 0039 and noone has yet helped me over that hurdle. To me computer programm written according to this description would get stuck in an infinite loop, but I may be wrong.

Marcia




msg:3255821
 1:46 am on Feb 18, 2007 (gmt 0)

>>would get stuck in an infinite loop

It wouldn't, because it's just incrementing occurrence counts based on an if/then decision and writing to a file.

tedster




msg:3255859
 2:47 am on Feb 18, 2007 (gmt 0)

I think I do see a logical problem there, Oliver, but not a infinite loop. The following is what looks like a contradiction to me: (Note that 'bad' here means 'lacking in predictive power'.)

[0036] In each phrase window 302, each candidate phrase is checked in turn to determine if it is already present in the good phrase list 208 or the possible phrase list 206. If the candidate phrase is not present in either the good phrase list 208 or the possible phrase list 206, then the candidate has already been determined to be "bad" and is skipped.

[0039] If the candidate phrase is not in the good phrase list 208 then it is added to the possible phrase list 206, unless it is already present therein.

In [0036] it sounds like no new 'good' phrases can ever be added. Then [0039] seems to contradict that. But this must be because of the poorly writte "plain English" patent language. If the 'good' and 'possible' phrase lists really stayed empty, someone would notice.

But this is all in the preliminary stage of identifiying 'good' and 'bad' phrases, so I just let it pass and assumed poor editing and/or proofreading. I'm very willing to grant that a solid list of related phrase is built. What interests me more is how that list of 'good' phrases and documents where they occur is now put to use.

This patented process for spam detection is looking for excessive numbers of related phrases (scraping a top 30 list to create a patchwork page could create that condition). It's also looking for excessive occurances of any one of the 'good' phrases - stuffing in other words.

The thing is that phrase based processing can also be used simply to rank honest documents for relevance to the search phrase. The way I understand it, spam documents identified by this process should be way over the top -- not just a little bit more intense than an honest document.

[edited by: tedster at 3:15 am (utc) on Feb. 18, 2007]

annej




msg:3255878
 3:11 am on Feb 18, 2007 (gmt 0)

what is "acceptable" is still something that "significantly exceeds the expected number".

I have a hard time writing articles without repeating "widgets" "widget" and widgeting" over and over. I also have difficulty not repeating the name of the kind of widgeting several times. So I guess I do significantly exceed in some cases. On one article I've been unable to reduce the phrase anymore as if I did it simply wouldn't make sense.

I would imagine that "thin" pages would scoot right by this test, whereas article pages written with certain target searches in mind might trip the spam test. Do others see it this way?

The pages I've had problems with are pages that took days if not weeks of research and writing to put together. Meanwhile I have a section of "Widget Notes" where I toss up any little announcement, interesting tidbit or whatever. Each little item is on a separate page. There is more chit chat in them and some are only one paragraph long. They are doing great in the serps! No penalties at all. So my "thin" pages are doing better than my scholarly pages.

Oliver Henniges




msg:3256230
 7:01 pm on Feb 18, 2007 (gmt 0)

Today I tried to understand "Phrase identification in an information retrieval system," which seems somewhat basic to the spam-patent.

I believe, that the "good phrase list" is NOT computed in the running applications of this patent as described: It had been compiled somewhere else before.

Maybe this bears interesting consequences for SEO:

Under what circumstances will new phrases make it to the list? Can you regain some limited control over the algo by artificially helping certain phrases over this threshold?

The list is not compiled the way the patent describes (though the figures given there might be helpful), but it neither can be static, because otherwise new topics would never be recognized, and I'd really be surpsised to hear that spammers don't target keywords mentioned in the news at present.

An alternative key-hole to regain influence might be the assymetry of the co-occurance matrix, which the algo produces after zeroing out those pairs of related phrases, which stay below the information-gain-threshold. This assymetry directly flows into the data of the phrase-clusters compiled later:

[0104] For example, assume the good phrase "Bill Clinton" is related to the phrases "President", "Monica Lewinsky", because the information gain of each of these phrases with respect to "Bill Clinton" exceeds the Related Phrase threshold. Further assume that the phrase "Monica Lewinsky" is related to the phrase "purse designer". These phrases then form the set R. To determine the clusters, the indexing system 110 evaluates the information gain of each of these phrases to the others by determining their corresponding information gains. Thus, the indexing system 110 determines the information gain I("President", "Monica Lewinsky"), I("President", "purse designer"), and so forth, for all pairs in R. In this example, "Bill Clinton," "President", and "Monica Lewinsky" form a one cluster, "Bill Clinton," and "President" form a second cluster, and "Monica Lewinsky" and "purse designer" form a third cluster, and "Monica Lewinsky", "Bill Clinton," and "purse designer" form a fourth cluster. This is because while "Bill Clinton" does not predict "purse designer" with sufficient information gain, "Monica Lewinsky" does predict both of these phrases.

I could imagine that areas exist, where only ten or twenty pages containing "Bill Clinton" AND "purse designer" WITHOUT "Monica Lewinsky" might force google to reevaluate this cluster-matrix, though of course not for this particular example. The patent does not say, that violations against this assymetry would trigger a filter, co-occurances below the threshold are simply deleted.

I'd speculate that the adwords keyword suggestion tool provides interesting data for an analysis of this assymetry.

thegypsy




msg:3256263
 7:51 pm on Feb 18, 2007 (gmt 0)

Interesting view.. for me it's always 'the dials' that make a dog's breakfast of things. With so many variables that have no stated definative value, we're grasping at straws (as usual).

I tend to look at things from the engineers side of things, in that 'what could I do with this Bad Boy' - from that perspective there are more ways to 'play with the dials' than one could ever dream of.

You mention AW data... how about all of the combined GTB (G ToolBar) and Personalized Search data? You get the idea.... grasping at straws again....

I was on the 'Latent Semantic Analysis' bandwagon a year ago... so I have a healthy respect for misdirection/misonception with all of the PaIR stuff....

Strangely, this journey has lead me to Personalized Search studies (thanks Brett.. look what U started).. I can already (with preliminary research) see where the PaIR model 'could' also come into play in data set refinements with Personalized Search Optimization (PSO?)

... well there's a few more musings for a Sunday afternoon....

jimbeetle




msg:3256291
 8:32 pm on Feb 18, 2007 (gmt 0)

You mention AW data... how about all of the combined GTB (G ToolBar) and Personalized Search data?

Exactly. I assume Google has a discrete corpus of phrases for each product channel. I would again assume that G would put all of these corpora to use.

... grasping at straws again

As always, so here goes. I can even see G giving weights to each corpus. Something along the lines of phrases derived from personalized search data are weighted more than phrases derived from general search, that, in turn, are weighted more than phrases plucked from the AdWords corpus, etc.

Don't know where that leads, just Sunday afternoon grasping.

Oliver Henniges




msg:3256299
 8:54 pm on Feb 18, 2007 (gmt 0)

> You mention AW data... how about all of the combined GTB (G ToolBar) and Personalized Search data?

Of course the data gained by phrase-based IR is only one part of the story, but you wanted to go deeper and that means discuss details of the patents, which always helps to refine it's understanding. The patents you thankfully linked to concentrated on this phrase-based-indexing stuff. Big Daddy's internal ralationships to other research areas, like those you mentioned, are also in parts described in the patents.

This 75 message thread spans 3 pages: < < 75 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved