homepage Welcome to WebmasterWorld Guest from 67.202.56.112
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 75 message thread spans 3 pages: 75 ( [1] 2 3 > >     
"Phrase Based Indexing and Retrieval" - part of the Google picture?
6 patents worth
thegypsy




msg:3247209
 5:28 am on Feb 9, 2007 (gmt 0)

I just noticed the thread on relationships of 'Phrase Based' layering in the -Whatever penalties.

This is interesting. That thread seems to be moving in a different direction so I started this for ONE simple area - Phrase Based Indexing and Retrieval (I call it PaIR to make life easier)

There is MORE than the thoughts that Ted started towards, as far as -30 type penalties. I have drudged through 5 of the PaIR related patents from the last year or so and written 3 articles and ONE conspiracy theory on the topic.

Of the more recent inferences was a conspiracy theory with the recent GoogleBomb defused affair.

In specific, from the patent Phrase identification in an information retrieval system [appft1.uspto.gov]:

"[0152] This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) in order to skew the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be "bombed" by artificially creating a large number of pages with a given anchor text which then point to a desired page. As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text. Importing the related bit vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase.

[0153] Each phrase in the index 150 is also given a phrase number, based on its frequency of occurrence in the corpus. The more common the phrase, the lower phrase number it receivesorder in the index. The indexing system 110 then sorts 506 all of the posting lists in the index 150 in declining order according to the number of documents listedphrase number of in each posting list, so that the most frequently occurring phrases are listed first. The phrase number can then be used to look up a particular phrase. "

Call me a whacked out conspiracy theorist, but I think we could have something here. Is it outright evidence that Google has migrated to a PaIR based model? Of course not. I would surmise that it is simply another layer that has been over the existing system and the last major infrastructure update (dreaded BigDaddy) facilitated it. But that's just me

I am curious as to complimentary/contrary theories as mentioned by Ted in the other "Phrase Based Optimization" thread. I simply wanted to keep a clean PaIR discussion.

For those looking to get a background in PaIR methods, links to all 5 patents:

Phrase-based searching in an information retrieval system [appft1.uspto.gov]

Multiple index based information retrieval system [appft1.uspto.gov]

Phrase-based generation of document descriptions [appft1.uspto.gov]

Phrase identification in an information retrieval system [appft1.uspto.gov]

Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]

I would post snippets, but it is a TON of research.. (many groggy hours).. I felt posting WHAT "Phrase Based Indexing and Retrieval" is, would also dilute the intended direction of the thread; which is to potentially stitch together this and the suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented.

Note: There is a sixth Phrase-based patent:
Phrase identification in an information retrieval system [appft1.uspto.gov]

[edited by: tedster at 6:59 am (utc) on May 14, 2007]

 

annej




msg:3247683
 4:55 pm on Feb 9, 2007 (gmt 0)

Thanks for starting this thread and linking all of the patents.

Here is what I've observed that may be related to this. It's a simplified version and might help people see what we are talking about.

Google has some way of isolating certain phrases that they deem typical of spam sites.

Some of us have pages that have too many of these flag phrases.

But the number of phrases that can cause the penalty depends on other factors.

It appears that these problem phrases do more damage if they are in the page title and possibly the H1 tags.

These phrases are more damaging if they are in inbound or internal linking anchor text.

Strength of inbound links is computed in so the level of problem phrases that hurt a page might be different depending on inbounds.

Google is constantly adjusting this filter so some pages are going in and out of it.

Many of the sites affected by this are well established sites that have all pages but the missing ones ranking quite well. Some of these missing pages had been in the top ten results for years.

It appears to me that pages with few inbound links can be hurt by scrapers as most of the link anchor text will be identical. This give the appearance that a lot of inbound links have the same phrases in the anchor text.

(Let's hear what others have observed)

thegypsy




msg:3247735
 5:55 pm on Feb 9, 2007 (gmt 0)

You see, Spam detection is merely ONE aspect of the system. This is my interest. Early last year I was a Latent Semantic Indexing (Analysis) fan, but it's limited in what it can accomplish compared to the PaIR model..

The documents deal with;

Indexing
Weighting Ranking
Duplicate content
Spam detection and weighting (yes - not all spam is bad?)
Back Links/Link Profiles
Personalized Search

And much more. It is NOT a specific fix, it is more of an all encompassing IR method with many NEW flags/filters to add onto the current infrastructure.

As far as Spam goes tho - some info


Some spam pages are documents that have little if any meaningful content, but instead comprise collections of popular words and phrases, often hundreds or even thousands of them; these pages are sometime called "keyword stuffing pages." Others include specific words and phrases known to be of interest to advertisers. These types of documents (often called "honeypots") are created to cause search engines to retrieve such documents for display along with paid advertisements...

A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

Google's Spam Patent [appft1.uspto.gov]

So logically/statistically using predictive measure with "good phrases" can do many things.

Once again, the comprehensive nature of the technology over a simplified model such as LSI is obvious.

I started writing about it last fall - and have generally met with Blank stares,he he.. so I am quite interested in the discussion, it's been a long time coming for me.

[edited by: tedster at 6:40 pm (utc) on Feb. 17, 2007]
[edit reason] attribute quotations [/edit]

jk3210




msg:3247958
 9:04 pm on Feb 9, 2007 (gmt 0)

The goal being to classifying each potential phrase as either “a good phrase or a bad phrase” depending on it’s usage and frequency; then using those ‘good’ phrases in predicting the usage of other ‘good phrases’ in the collection of web pages.

Okay, so if the predicted "other good phrases" are not present, then what?...the page is a spam suspect? What's the purpose of this little exercise they're doing?

jimbeetle




msg:3248010
 9:58 pm on Feb 9, 2007 (gmt 0)

thegypsy, just as a point of information and to clarify, what is the source of that long quote in msg #3247735. Is it from a write-up of yours or another source?

I'm particularly interested in this part:

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases.

This really helps me understand the approach as an entirety. Whose words and numbers are they?

annej




msg:3248146
 12:44 am on Feb 10, 2007 (gmt 0)

Someone correct me if I'm wrong but my impression is that a 'good phrase' means it's a good predictor of spam. So from our point of view a good phrase is a bad phrase to have too many times on our webpage. ;)

I don't think it's just the number of times a 'good phrase' is on a document or in combination with a related 'good phrase'. I think where the phrase is placed in the document or linking documents is also considered. As I mentioned above it seems page titles and anchor text are weighted more. This is from comparing pages that are still doing well with pages that have plunged.

Do any of the patent documents indicate this?

tedster




msg:3248151
 1:10 am on Feb 10, 2007 (gmt 0)

From the Spam detection [webmasterworld.com] patent

[0027]For example, the phrase "President of the United States" is a phrase that predicts other phrases such as "George Bush" and "Bill Clinton." However, other phrases are not predictive, such as "fell down the stairs" or "top of the morning," "out of the blue," since idioms and colloquisms like these tend to appear with many other different and unrelated phrases. Thus, the phrase identification phase determines which phrases are good phrases and which are bad (i.e., lacking in predictive power).

The term "good phrase" appears very early on in the process in step [0008], long before the spam detection parts. I read it as saying that a "good phrase" is one that can be used as a relevance indicator for the search phrase. Of course a spam page will target such good phrases, but so will any good result for the search as well.

MHes




msg:3248181
 1:58 am on Feb 10, 2007 (gmt 0)

I suspect that phrase density can be very high, maybe 90% but still be deemed relevant. For example, a recruitment page about 'Telesales jobs in London' may have the phrase 50 times as it defines each job advertised and this could be a very good page. If the phrase density is very high then google will look for the associated phrases in pages clustered with that page. If the clustered page has the phrase in a low density, e.g. on a specific job description page, then the associated phrases should be on that page. If they are, then the page with a high density which is clustered with it, is still deemed to be OK.

Therefore the associated phrases need not be on the same page, but in the cluster of pages and the overall density and frequency valued over the whole cluster.

This brings up a subtle point that google may be addressing. In the case of recruitment, the top level navigation page may have an extremely high frequency/density of a phrase. Each reference to the phrase links to an actual job details page. Often the job may have expired within a day or two, but google will still have the cache for maybe another 7 days. Therefore it would rather offer the high density 'navigation' page than the actual job detail page which may have expired. It will do this if the associated phrases were found on the clustered job details pages, and then dump them in preference for the more reliable top level navigation page.

Thus I think the patient describes the process on a page basis, but it is applied on a cluster basis. It's all about navigation and offering the user the best entry point into a website, rather than automatically offering the highest scoring page which may be out of date.

annej




msg:3248309
 6:29 am on Feb 10, 2007 (gmt 0)

I've been finding some of my missing pages then doing the repeat with omitted results option. Then I print out the cluster. As I result I can see which pages are clustered way down there at 996 or whatever. At first I thought the pages in the cluster were all penalized or filtered out. But now I'm seeing some pages that do just fine in the serps if their topic is searched.

Sometimes the pages in the cluster are in the same section as the problem page is. Sometimes they just have words or phrases in common.

I've been trying to sort out how this information could help me and find I'm just getting more confused Could someone explain clustering and in what way looking at them could help me?

Lexur




msg:3248320
 6:52 am on Feb 10, 2007 (gmt 0)

I think it could be a nice method (check "good phrases" in pages's clusters) to detect spam because google engineers can detect a bunch of spam related terms.

For those of us not interested in spam prevention but in rankings, do you think this model can be used with a few billion of pages with a zillion of terms in a few dozens of languages?

[edited by: Lexur at 6:58 am (utc) on Feb. 10, 2007]

MHes




msg:3248354
 9:08 am on Feb 10, 2007 (gmt 0)

Lexur - Google has been using semantics for a longtime. Remember the ~widget - widget search that used to show broad match terms? If google took several pages content (a cluster) and treated it as one 'document' then ran their semantics process that was extended to look for 'phrases' they could probably do this at run time. They would then divide and rank the clusters according to which pages bonded according to predictive phrase analysis. I'm also suspicious that they choose the final ranking pages according to user navigation in order to lower the chance of offering an out of date page. They would never have to do this with millions of pages, just the top fifty for a search phrase

[edited by: MHes at 9:10 am (utc) on Feb. 10, 2007]

thegypsy




msg:3248487
 2:26 pm on Feb 10, 2007 (gmt 0)

I would like to reinterate that SPAM DETECTION is merely ONE facet of the system. There is a single Patent alone JUST for 'Spam Detection' in a PaIR system.

IT is NOT s ingular focused layer. It is, or could be, a stand alone IR System. Is it? Highly unlikely as we would have seen it come in

I suspect it has merely been added to existing infrastructure. If you REALLY want to play conspiracy theorist, you could surmise Google used the whole 'GoogleBomb' announcement to 'turn up the dials' on the PaIR infuences on the system in an attempt to keep it 'under the radar'

It is the method of the 'Indexing and retrieval' that is at the core... secondary data sets are then developed from there to deal with issues such as

Wieghting/Ranking
Duplicate content
Spam detection and weighting
Links and link profiles
Snippets (page descriptions)
and so on.....

This is where LSA technologies break down

SO ONCE AGAIN.. it is NOT limited to mere Spam detection...

thegypsy




msg:3248493
 2:40 pm on Feb 10, 2007 (gmt 0)


thegypsy, just as a point of information and to clarify, what is the source of that long quote in msg #3247735. Is it from a write-up of yours or another source?
I'm particularly interested in this part:

According to the folks that drafted it, a normal related, topical phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases.

This really helps me understand the approach as an entirety. Whose words and numbers are they?

THat's from the articles I wrote on it. A more simplified break down....

SNIPPET

As you (undoubtedly) remember the core concept of the processing is to identify valid (actual/real) phrases in a given document collection (or web pages in our case). The goal being to classifying each potential phrase as either "a good phrase or a bad phrase" depending on it's usage and frequency; then using those "good" phrases in predicting the usage of other "good phrases" in the collection of web pages.

What's a "Good Phrase"?

The classification for possible phrases as either a good phrase or a bad phrase is when the possible phrase; "appears in a minimum number of documents, and appear a minimum number of instances in the document collection". What that number is, we don't know. Those are the "dials" the Search Gods themselves only have access to. It is almost looking at a Phrase Density over the aggregate of documents (the web site). Also, a BAD phrase is not one with dirty words, it is simply a phrase with too low a frequency count to make the "good" list.

[edited by: tedster at 3:45 pm (utc) on Feb. 10, 2007]

jimbeetle




msg:3248570
 4:15 pm on Feb 10, 2007 (gmt 0)

Thanks for the clarification, thegypsy.

TheWhippinpost




msg:3248820
 9:22 pm on Feb 10, 2007 (gmt 0)

Once again, the comprehensive nature of the technology over a simplified model such as LSI is obvious.

I too have expected something along the lines of this phrase theory, though I see it as an extension to an LSI-type algo.

Whereas LSI essentially talks about synomyns of words, this almost lends itself to "synomyns of phrases"

If you were to compare the pages of a product tutorial, a product review, and a typical legit ecommerce product page (we'll assume it's a tech kind of product here), you would most likely see a far higher density of technical language being used in the merchants page, than either of the others.

What's more, the proximity, ie... the "distance", between each of those technical words, are most likely to be far closer together on the merchants page too (think product specification lists etc...).

Tutorial pages will have a higher incidence of "how" and "why" types of words and phrases.

Reviews will have more qualitative and experiential types of words ('... I found this to be robust and durable and was pleasantly surprised...').

Sales pages similarly have their own (obvious) characteristics.

Mass-generated spammy pages that rely on scraping and mashing-up content to avoid dupe filters whilst seeding in the all-important link-text (with "buy" words) etc... should, in theory, stand-out amongst the above, since the spam will likely draw from a mixture of all the above, in the wrong proportions.

Therefore the associated phrases need not be on the same page, but in the cluster of pages and the overall density and frequency valued over the whole cluster.

Most definitely.

jimbeetle




msg:3248884
 11:12 pm on Feb 10, 2007 (gmt 0)

Whereas LSI essentially talks about synomyns of words, this almost lends itself to "synomyns of phrases"

I think "synonyms of phrases" gets a bit too far away from the idea. We might be able to get away with saying "related phrases" (because in a way they are), but even that is not quite illustrative of the presence of Phrase A predicts the probability of the presence of Phrase B.

Therefore the associated phrases need not be on the same page, but in the cluster of pages and the overall density and frequency valued over the whole cluster.

This and other points MHes raised are very interesting. Using all the usual possible/maybe/it feels like caveats, this can possibly explain folks comments in other threads of taking hits on pages in one directory. Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

jk3210




msg:3248893
 11:32 pm on Feb 10, 2007 (gmt 0)

Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

For a 3-word "cityname widgets" search, 43 of my pages are grouped at the bottom of the last serps page, mostly ones from the same directory and all interlinked. Of the pages in the group of 43 that AREN'T in that one directory, they are up the directory tree and link to those pages.

Yet, the main page that all those individual pages link to is still #1.

annej




msg:3248897
 11:36 pm on Feb 10, 2007 (gmt 0)

I find the idea that Google may be looking at related phrases in a cluster of pages has interesting possibilities.

Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

In the case where a good inbound link appeared to bring back one of my contents pages within a couple of days I also got back 3 pages that were linked from that contents page. But one page remains missing.

I'm thinking the missing page had too many other "good" phrase matches so the return of the Contents page wasn't enough.

In another section of my site the whole topic is gone, contents page and article pages. Two other sections of my site just have one or two pages missing and there seems to be no association with the contents pages.

TheWhippinpost




msg:3248951
 2:02 am on Feb 11, 2007 (gmt 0)

I think "synonyms of phrases" gets a bit too far away from the idea.

Hence why I quoted it Jim... I used "synomyn" because that's how people generally began to describe the words LSI outputted, when in fact it's more about discovering unique words that are most likely to be found together (a very simple summary, obviously), these aren't synomyns you necessarily find in the dictionary.

So you end up coming out of that process with a few words that one could expect to see when flirting around another particular word... a related word(s), ya might say.

... or a predictive word; or even an expected word!

I've played a lot with the tilde (~) operator over time and know the algo can "relate" a brand-name to a manufacturer-name. It has also learnt to relate acronyms to it, as well as others.

To expand that to encompass a series of words, instead of just one, would be just a maths and computational exercise, I would'a thought... and we all know about BD.

Assuming pages in a single directory are somewhat related and maybe interlinked, that would -- or can -- fit MHes's definition of a "cluster of pages".

I'd go further than that: The calculation will be made across a cluster of documents it has already judged to fall within the area of interest - which might even be the number of documents it says it found on the SERP.

This won't be just your site, or directory, though clearly if you have a good focussed directory, it would/could figure more dominantly than the opposite case.

[edited by: TheWhippinpost at 2:03 am (utc) on Feb. 11, 2007]

thegypsy




msg:3249064
 6:49 am on Feb 11, 2007 (gmt 0)

Whipping post – strangely it is NOT and addition to any LSA/I technologies. I dare say we may have missed the boat and it never left AdSense/AdWords. This is a standalone method that is far more comprehensive than LSI …. Jim seems to be getting the idea….

Tho following the trail of LSA/I last year did bring me to this.

To keep things moving along a snippet from one of the articles – it delves into term extensions. Connecting words that create phrasings and the PaIR basic model for identification


Phrase Extensions and identification
Phrase extensions are merely additional words on the core term(s). If we had the core term ‘Baseball Cards’ we could ‘extend’ it with ‘Vintage Baseball Cards’, ‘Buy Vintage Baseball Cards’ and finally ‘Buy Vintage Baseball Cards Online’ – you get the idea.
To identify a potential phrase the algo looks at a phrase such as "Hillary Rodham Clinton Bill on the Senate Floor", from which it would take; "Hillary Rodham Clinton Bill on," "Hillary Rodham Clinton Bill," and "Hillary Rodham Clinton". Only the last one is kept. It would also identify "Bill on the Senate Floor" and the inferences down to ‘bill’.

And scoring ranking
In the end it is these related phrase/theme scores that are used in the ranking of documents based on a given search query. The more related phrases and secondary related phrases found in the document for the query phrases would be ranked highest. The semantically topical, relevant page gets the highest ranking.

How about backlinks?
Anchor phrase scoring is also counted in the related query phrase in the text links to other documents. There are 2 scores here being the ‘body’ score and ‘anchor’ score. Greater scoring is obviously given if a good phrase is in the text link as well as on the body of the referenced document. Additionally the anchor text TO your site is also analyzed and scored accordingly under the same methods.

Once again, the PaIR model is FAR more comprehensive in it’s abilities than the LSI model. Was LSA/I used in the Organic SERPs since 2003 (when G purchased Applied Semantics)? Maybe. If this is part of the ‘new’ world, it is one hell of an upgrade….
…and deeper we go….

MHes




msg:3249138
 10:29 am on Feb 11, 2007 (gmt 0)

There's an assumption that 'missing pages' or ones sent to the 950+ have been totally rejected in googles eyes for 'relevancy'. I think this may not always be the case and those pages in some circumstances have played an important part in helping another page rank position top. There are too many reports saying that their page was totally relevant but google chose to rank a different page. So I think this assumption that google has failed to appreciate the relevancy of a page by putting it 950+ needs to be re addressed. Google may have been very impressed by the page but it was not deemed a good page to send the visitor for reasons beyond the pages relevancy to the search query. It could be argued that the fact that the page was subjected to 950 was because it was a candidate for a top ranking.... so close but so far.

Pages may go 950+ for several positive reasons as well as negative ones:

1) They have been put into the potential 'spam index' and no other page within their cluster has shown sufficient 'predictive phrases' to support them. Nothing wrong with the page itself, but with the pages around it. Hence, that page could rank top for another search phrase that does have the right fit of pages around it.
2) They are a poor landing page for a visitor. Predictive phrases were found in the cluster of pages but this page does not 'navigate' well within the cluster. Therefore a visitor would not be able to navigate between the pages that google saw as being supportive.

In a lateral way of thinking, lets suppose a spider is a person who cannot speak English. They are looking for documents with the words 'widget' and 'blue'. They find 4 documents in one filing cabinet (the website) with these words on. The spider takes copies of the 4 documents as potential pages for ranking and goes to its desk. The spider also knows the word 'auto mechanical' is an associated phrase and this also appears on one of the pages. This page is looking good lets call it page A. Then the spider notices the phrase 'auto mechanical' is also on another page (page B) but only mentioned once and this links to page A. It can only choose two pages, but page C is also looking good and has the words 'widget' and 'blue' all over it. So although page A is looking the best on a 'page basis', page B is the best from a navigation basis within the cluster. It therefore chooses page B and C, because a user can go to page A from B and will see page C in the serps as an indent. It thus covers maximum ground with these two pages and ends up throwing the apparently best page into the bin. In this example, page B was chosen because page A was best and the 'predictive phrase' supported the link between the two pages. Page D was totally rejected because there was no direct link from any other page in the cluster.

So, the best page can be put into the bin ( or 950+).... but it is still the best 'on page' match and valued as such by Google. If page B had not linked to page A, then all four pages would have been relevant in isolation. This is not a helpful scenario for the user and brings into doubt the overall 'theme' of the collection of pages. The rule may be, if the webmaster has made no connection between the pages, then there is no connection between the pages for this search query.... they all go 950+. The important bit may also be how the pages link together. Page A may have linked to all the pages via a navigation template or via links with unrelated words. This may cause the link to be ignored for a specific search phrase.

Now add a further 100 factors into the analysis of each page and find the exceptions to the rule!

annej




msg:3249285
 4:23 pm on Feb 11, 2007 (gmt 0)

In frustration I stripped a penalized contents page down to just a few links with one word anchor text. The logical search phrase for the page is nowhere to be seen. I saw it as a throw away page just there until the new contents page gets in the serps.

Well now is it number one again. I suppose based on the inbound links. But it is no longer penalized I suspect because it no longer has those "good" phrases.

Lesson, strip your penalized web page to practically nothing and you will find it back at the top. Of course that isn't very useful to the visitors.

jk3210




msg:3249290
 4:36 pm on Feb 11, 2007 (gmt 0)

@annej

How long did it take once the page was re-spidered?

thegypsy




msg:3249291
 4:39 pm on Feb 11, 2007 (gmt 0)

So I think this assumption that google has failed to appreciate the relevancy of a page by putting it 950+ needs to be re addressed.

MHes – While I appreciate the analogy and insight, let’s NOT get into -3 ..950 ..whatever penalties. It is a silly discussion you will NEVER see me having, unless to tell folks NOT to. There are many TOTALLY CONFUSED threads on that silliness…. You are making HEAVY assumptions that do not fit with my calculations.

Didn’t you ever think –“ with 24 billion pages it is numerically likely that 50 or so people would end up with common anomalies?’ – I sure did. I have watched the penalties name even evolve over and over. From the outside, as a search engineering and mathematics enthusiasts, quite amusing.

Much of what your saying does not capture the essence of the technology. You are pigeon-holling instead of clearing your mind and viewing it clean. You are trying (already) to adapt it to your theories and beliefs. This, in my estimation, will not serve you well.

There are MANY ways of weighting (rankings) documents within the system and as always, ‘playing with the dials’ will produce different effects. It is not as simplified as you seem to be perceiving it. The ranking is more of an additive method. You are talking about this and that being ‘rejected’ – it is not quite as such. Along the sorting and weighting path documents are given ADDITIONAL weighting as they satisfy various aspects of the algo. While minor, important to understand the paths and scoring methodology.

So, as long as no one minds, let’s stick to PaIR discussions…. NOT penaties and filters. It is far to early for such assumptions to be made. Let’s appreciate it for what it is, not what we believe it to be.

Even AnnaJ was making simple assumptions from recent experiences on ONE site in ONE market... it is not a healthy endeavor to make any assumtions at this point

annej




msg:3249303
 4:59 pm on Feb 11, 2007 (gmt 0)

How long did it take once the page was re-spidered?

About three days. The same was true when a new inbound link brought a page back. It appears once you are spidered the change takes place.

AnnaJ was making simple assumptions from recent experiences

I meant to be giving examples of experiences that might fit this patent.

I'm not sure how we should proceed without examples or other possibilities. Should we reread the patent and see if we have any new insights? How do you view this thread going?

[edited by: annej at 5:07 pm (utc) on Feb. 11, 2007]

TheWhippinpost




msg:3249307
 5:02 pm on Feb 11, 2007 (gmt 0)

Whipping post – strangely it is NOT and addition to any LSA/I technologies. I dare say we may have missed the boat and it never left AdSense/AdWords. This is a standalone method that is far more comprehensive than LSI …. Jim seems to be getting the idea….

I'm basically agreeing with you (as far as I can tell). In my mind, it is an extension - or an evolution, if you like - to an LSI-type algo. Note particularly the word "type" there - I've always tried to be careful in discussions in not saying that G is employing LSI; we don't know.

But we do know it is something like LSI, at least. I don't think it's too difficult to extrapolate a scenario where, once you have all that data for single words, and once the technology becomes available, you can start crunching numbers that tie words together to form a better oppinion of what is actually being discussed... just as we do in the real world.

Example: "Breaking neck" - Says one thing, but "Breaking neck speed" tells me a bit more about the exact nature of the conversation. Now I would expect a narrower range of words like: cars, bikes, boats, whatever...

thegypsy




msg:3249311
 5:13 pm on Feb 11, 2007 (gmt 0)

Ok.. I am with you there...

If they didn't use some of the LSI technologies, it seems it may have motivated them to work on a system such as this.... so the genesis is likely LSI based one could surmise most certainly

TheWhippinpost




msg:3249326
 5:32 pm on Feb 11, 2007 (gmt 0)

Although I have a couple of "issues" with MHes's synopsis - mainly because I've not witnessed the "950" phenomenom, and I think the cluster of documents should be viewed more widely to encompass all web pages that fit the topic queried - I think he's close (at least to the model I have in my head anyway).

One potential problem keeps haunting me in all this though...

We could safely say, I think, that Einstein's paper on Relativity is the authoritive document on the subject. From that paper we're introduced, for the first time, to the phrase, "Theory of Relativity".

And from that phrase we would somewhere expect MC(squared) and so on...

OK, so let's imagine that from the point of publication, a million documents are written by students, teachers, scientists, hobbyists, whoever...

And every one of those documents are wrong - They talk about (and refer to) each others documents and conclusions. They introduce along the way, other phrases which become a part of the community's language... but they are wrong, and have skewed Einstein's works.

What, and how, could Google combat that false-positive?

... and then relate that to your own field?

(Incidentally, Einstein said that, fact is truth, because we agree it is (or words to that effect), this false-positive example underlines the danger of that, for Google (and us), quite nicely.)

[edited by: TheWhippinpost at 5:37 pm (utc) on Feb. 11, 2007]

MHes




msg:3249412
 7:46 pm on Feb 11, 2007 (gmt 0)

TheWhippinpost - I suppose Google is not in the business of evaluating 'truth'. However, I seem to remember Brett talking of 'owning keywords'. In essence, I took this to mean that there is some weight given to the first recorded occurance of a 'new' phrase. Thus Einstein's web page would be remembered as being an authority for that phrase and difficult to displace.

>While I appreciate the analogy and insight, let’s NOT get into -3 ..950 ..whatever penalties. It is a silly discussion you will NEVER see me having, unless to tell folks NOT to.

OK, so why did you say in your intro that you wanted.... "suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented." Or am I reading you wrong? What do you want in this thread? I think a discussion without reference to 950 is a bit pointless.

cabbie




msg:3249478
 9:44 pm on Feb 11, 2007 (gmt 0)

For all of google's patents to detect spam, their results have never been so full of it.

This 75 message thread spans 3 pages: 75 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved