| This 75 message thread spans 3 pages: < < 75 ( 1 2  ) || |
|"Phrase Based Indexing and Retrieval" - part of the Google picture?|
6 patents worth
I just noticed the thread on relationships of 'Phrase Based' layering in the -Whatever penalties.
This is interesting. That thread seems to be moving in a different direction so I started this for ONE simple area - Phrase Based Indexing and Retrieval (I call it PaIR to make life easier)
There is MORE than the thoughts that Ted started towards, as far as -30 type penalties. I have drudged through 5 of the PaIR related patents from the last year or so and written 3 articles and ONE conspiracy theory on the topic.
Of the more recent inferences was a conspiracy theory with the recent GoogleBomb defused affair.
In specific, from the patent Phrase identification in an information retrieval system [appft1.uspto.gov]:
|" This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) in order to skew the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be "bombed" by artificially creating a large number of pages with a given anchor text which then point to a desired page. As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text. Importing the related bit vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase. |
 Each phrase in the index 150 is also given a phrase number, based on its frequency of occurrence in the corpus. The more common the phrase, the lower phrase number it receivesorder in the index. The indexing system 110 then sorts 506 all of the posting lists in the index 150 in declining order according to the number of documents listedphrase number of in each posting list, so that the most frequently occurring phrases are listed first. The phrase number can then be used to look up a particular phrase. "
Call me a whacked out conspiracy theorist, but I think we could have something here. Is it outright evidence that Google has migrated to a PaIR based model? Of course not. I would surmise that it is simply another layer that has been over the existing system and the last major infrastructure update (dreaded BigDaddy) facilitated it. But that's just me
I am curious as to complimentary/contrary theories as mentioned by Ted in the other "Phrase Based Optimization" thread. I simply wanted to keep a clean PaIR discussion.
For those looking to get a background in PaIR methods, links to all 5 patents:
Phrase-based searching in an information retrieval system [appft1.uspto.gov]
Multiple index based information retrieval system [appft1.uspto.gov]
Phrase-based generation of document descriptions [appft1.uspto.gov]
Phrase identification in an information retrieval system [appft1.uspto.gov]
Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]
I would post snippets, but it is a TON of research.. (many groggy hours).. I felt posting WHAT "Phrase Based Indexing and Retrieval" is, would also dilute the intended direction of the thread; which is to potentially stitch together this and the suspicians of PaIR being at work in the -whatever penaties... more evidence that is it being implemented.
Note: There is a sixth Phrase-based patent:
Phrase identification in an information retrieval system [appft1.uspto.gov]
[edited by: tedster at 6:59 am (utc) on May 14, 2007]
> You mention AW data... how about all of the combined GTB (G ToolBar) and Personalized Search data?
Of course the data gained by phrase-based IR is only one part of the story, but you wanted to go deeper and that means discuss details of the patents, which always helps to refine it's understanding. The patents you thankfully linked to concentrated on this phrase-based-indexing stuff. Big Daddy's internal ralationships to other research areas, like those you mentioned, are also in parts described in the patents.
When it comes to identifying co-occuring phrases, I also wonder about gmail data. Google does target ads there, and they even auto-save drafts while a new message is being composed, so we know they access the character strings.
|I believe, that the "good phrase list" is NOT computed in the running applications of this patent as described: It had been compiled somewhere else before. |
Yes, I read it that way too. For example from the spam patent, in the general section about the Search System from  through  mentions are made about the speed of results delivery:
| ...since these are already pre-ranked by their relevance to the phrase, the set of documents can be directly provided as the search results, providing essentially instantaneous results to the user. |
Hmmmm... that would throw PR out completely, no? With regard to speed. I also noted this section with interest:
| ...The described operations and their associated modules may be embodied in software, firmware or hardware. |
There's a lot more here that I want to work through. Intriguing items such as " c) Ranking Documents based on Date Range Relevance" where the patent discusses sometimes promoting newer documents, sometimes promoting older documents, and sometimes the most frequently (or recently?) updated ones.
The whole section about anchor phrases also seems very intriguing, pointing to how the 'thematic' importance of the entire linking page is calculated (something frequently noticed here on the forums.)
|If it's articles or product pages it doesn't really matter as U are competing with LIKE pages |
Gypsy, When you say LIKE pages do you mean pages that are alike in topic (extreme widgiting) or alike in style (sales catalog, academic article, scraper page)?
|The whole section about anchor phrases also seems very intriguing, pointing to how the 'thematic' importance of the entire linking page is calculated |
Tedster, Which patent and where is this information? I'd like to read through it.
I first noticed it in the spam detection patent (at  b) Ranking Documents Based on Anchor Phrases), but there are anchor phrase details in all 5 patents.
All of the time I am wondering how link-structure, anchor-text and pagerank nowadays work together. We all know dozens of threads where people report, that the toolbarqueries-results are completely inconsistent with the original PR-Formular.
I found some hints to local rank, but neither this does really explain the story. As thegyspsy said, the patents insinuate that all these factors and others are combined "on the fly", perhaps even whilst crawling, but for performance issues this is absolutely impossible with the original pagerank calculation.
What alternatives might google use, since PR obviously still plays an important role:
- does it simply use very old data, perhaps last time precisely calculated in 2004 or so, together with some dirty data added later, trusting in the fact that those "newer" parts of the infrastucture will wipe out the dirt sufficiently?
- or are there means to PR-calculate subsets of the internet, which are sufficiently independent from a structural perspective? I'd suspect that such large entities automatically develop some self-referentiality or self-similarity, which makes it superfluous to reiterate over the whole structure, but instead perform iterations over the matrix of relatively independent subsets. And if so on more than two levels, it might perhaps even be possible this calculation is perfectly embedded in the new Big-Daddy infrastucture in one big crawl process.
From my half-baked understanding of the butterfly-effect I would say: No, it is not that easy. But here I clearly scratch the limits of my mathematical knowledge of fractal theory and related fields.
I did not want to drive your thoughts too much OT, but as I said elsewhere, the PR-calculation formular is always the most critical item from a mathematical and a performance point of view. BTW, did you notice this one in the second patent:
| Another problem with conventional information retrieval systems is that they can only index a relatively small portion of the documents available on the Internet. It is currently estimated that there are over 200 billion pages on the Internet today... |
Has anyone calculated what this would mean for the original PR formular? That is a hell of a lot, another eight bits beyond 32, on both paths of the two embedded loops.
So in order to get back to topic:
1) Does pagerank play any role at all still, considering what you, Tedster, quoted about relevance-evaluation of phrase-analysis alone?
2) If it does play a role: Where does it hook in during the relevance evaluation, and where and when in the infrastucture is it calculated?
3) Is (old) pagerank-data nowadays perhaps only used to define crawl-depth and crawl-speed, thus important meta-coefficients of the analysis of large chunks of the internet (10 mio pages each or one harddrive full, as said in the patents), thereby passing some initial variables to the calculation of local-rank and further phrase-based-analysis?
Sorry to not comment on all the question that have risen, but I've been lurking on this thread and had a few thoughts.
This entire system is designed against the "scraper" type of pages if I understand correctly. And by scraping, I don't necessarilly mean doing an automated process, but data mining in general, collecting information and shoving it all into one page having all... meaning ALL semantically related words there to provide relevance. While not having but one or two of these phrases in the anchor text pointing to the page.
Pages that are the results of lengthly and legit research may be very similar in their parameters.
In other words, if there are too many occurrences of only mildly related phrases - most of the time just to see if the page could rank for this and that too - the filter kicks in.
Sorry but I like to keep it simple, otherwise I couldn't handle all the information needed to do SEO.
As for the datasets, I have an idea you mentioned before as well.
But I'd like to flesh it out a bit more so that *I* could understand it.
Reverse engineer the "intent" behind this idea, and look at it in this way:
Such a site collects keyphrases related to "widgets", from let's say, the AdWords Keyword tool. Just does a query to see which keywords and phrases are the most sought out, and have the highest competition ( i.e. the highest paying ads ).
Then it does a query on them in Google and posts the titles, descriptions and links - as the "content" of its own homepage - along the blocks of ads, in hope of having created a context that would invite the highly relevant and highest paying advertisements related to "widgets".
Many well optimized blocks of text with an outbound link per paragraph ( anchor text and target also relevant and well optimized, it's on the top of the SERPs after all ).
But the blocks are only relevant to each other in terms that they all contain the word "widgets".
Then comes along the new system from Google.
First it goes to a different site, and looks at the thin products page of example.com ( a legit online store ) and counts the number of phrases it knows to be competitive, and contain the word "widgets" - which it recognizes to be the main theme of the page ( based on the navigation anchor, title, KWD, and so on ).
It finds 3 related competitive phrases on-page, for example "cheap widgets", "blue widgets" and "buy widgets", but compared to the overall picture of the set having 50 competitive - and only mildly related - terms, this is below the threshold, so the page gets an OK, and the filter leaves it alone.
Then the system examines the first site, the one at the scraper or otherwise massively researched page.
It sees that the main theme is widgets here as well.
Starts counting the occurrances of the phrases it predicts to be on the page based on this theme ( "predicts" as in, this system was designed to assume the pages are made for spam, thus it looks for the competitive terms ).
And then... whoops.
42 out of the 50 competitive terms are found, even though this page only has an inbound anchor text for about 1 or 2, or in case of a scraper site perhaps NONE.
Sometimes when a given word is broad in meaning, these phrases may only be semantically related. Otherwise a legit page may very well have all 50 featured but then it is likely to be the homepage, and with internal links and outbounds supporting the relevancy.
This page doesn't have these parameters in its support.
The result is, the page gets flagged as spam.
Whether it was a scraper site or a superbly researched page, it will now not rank where it used to be.
What's your opinion of this scenario?
And whether the most convenient dataset is actually at AdWords, because the very reason for spam is most likely AdSense relevancy.
Miamacs - Excellent post. We have been thinking the same way and found that when you start exploring how 'N' numbers and 'E' numbers would achieve this (as defined in the patient) the process is elegant, efficient and brilliantly simple. The one problem is the potential damage to innocent and comprehensive expert pages, but there are compensating factors that can pull a page back into the results. Basically, if it IS a really good page, other factors can kick in. We have been looking at Adwords data and concluded that it is a clue as to what data they have to work with, but comparing it with the content of pages that currently rank well could demonstrate that a lot of the suggested phrases are not a problem. Having said that, it all depends on the tolerances set and 'what phrase' actually triggers 'what phrase'. Gut instinct is often the decision maker and seems to works just as well!
When we speak of the word page rank, it can be a broad term that is easily confused. Some people think its the little green tool bar, some think its a behind the scenes little green tool bar. Now throw the terms trust rank and local rank in there and it gets real confusing. We are talking about small parts of the overall algo here where one can trump the other and I have a perfect example of this.
We have one page on a very trusted site, never been any spam, unique content, text outweighs html, w3c compliant, has a great flash slideshow picture on it. The page itself has been ranking between 1-3 this entire year. Now, the page has no external links to it, just internals. The little green toolbar says 2 on it and it has stayed that way the past year as well.
This right there shows that trust, page layout, good and unique content can easily trump out external links. Is google looking at the phrases of this page? Possibly when it was first crawled, but now google is looking at the 75% bookmark rate of the page and using that in the overall ranking as well. We talk about bits and pieces of the algo which is good for people to understand, but people also need to understand that its just a bit and piece and they need to work on the overall site and not just concentrate in one area.
Tedster - "The whole section about anchor phrases also seems very intriguing, pointing to how the 'thematic' importance of the entire linking page is calculated"
Would really appreciate some guidence on where that is in the patient. Sounds very important.
Open any patent in your browser and do a Find in Page for 'anchor' or 'anchor phrase' -- all 5 patents have related sections.
In IE menu: Edit > Find > This page
My mind gets a bit boggled with the patents but it does appear to me that emphasis is given to all anchor text including inbound, outbound and internal. I was tending to look at anchor text of links to the page.
I'm seeing why people have said that Google looks at outgoing links now.
In essence, with the links it proposes;
1. Looking at the phrase relevance score of the page the link is on
2. Looks at the relevance of the link text
3. Looks at the phrase relevance score on the destination page
I would also assume the SITE 'profiles' of the destination and delivery pages could also come into play.
This could potentially explain the defusal of GoogleBombs....
People still give the example of 'click here' and Adobe - with the above method, we can safely assume pages with the 'Click Here' Href also contain a reference to 'Adobe' or 'Adobe Acrobat' - so the links would still hold some value...
Whereas 'failure' and related phrases are NOT on GWB/White House pages.....
Just a thought tho --- I haven't really followed it much.....
I agree, annej. These patents do point to anchor text as an on-page factor, so it may well be in play. One way or another it has always been a factor anyway, even before these specific patents.
The powerful on-page influence of anchor text is often not considered by webmasters - and it's relatively easy to overstuff a phrase on the page that way, especially when you first "get" that anchor text affects the target page as an off-page factor.
I want to remind people that in these patents a phrase can consist of one word. I was just reminded of this when I was working on some lost pages today. Back when I worked on them before I was thinking of a phrase as being more than one word. Now with a fresh look it appears that one word was doing the damage.
| This 75 message thread spans 3 pages: < < 75 ( 1 2  ) |