Page is a not externally linkable
- Google
-- Google SEO News and Discussion
---- "Phrase Based Indexing and Retrieval" - part of the Google picture?


Oliver_Henniges - 10:28 am on Feb 21, 2007 (gmt 0)


All of the time I am wondering how link-structure, anchor-text and pagerank nowadays work together. We all know dozens of threads where people report, that the toolbarqueries-results are completely inconsistent with the original PR-Formular.

I found some hints to local rank, but neither this does really explain the story. As thegyspsy said, the patents insinuate that all these factors and others are combined "on the fly", perhaps even whilst crawling, but for performance issues this is absolutely impossible with the original pagerank calculation.

What alternatives might google use, since PR obviously still plays an important role:

- does it simply use very old data, perhaps last time precisely calculated in 2004 or so, together with some dirty data added later, trusting in the fact that those "newer" parts of the infrastucture will wipe out the dirt sufficiently?

- or are there means to PR-calculate subsets of the internet, which are sufficiently independent from a structural perspective? I'd suspect that such large entities automatically develop some self-referentiality or self-similarity, which makes it superfluous to reiterate over the whole structure, but instead perform iterations over the matrix of relatively independent subsets. And if so on more than two levels, it might perhaps even be possible this calculation is perfectly embedded in the new Big-Daddy infrastucture in one big crawl process.

From my half-baked understanding of the butterfly-effect I would say: No, it is not that easy. But here I clearly scratch the limits of my mathematical knowledge of fractal theory and related fields.

I did not want to drive your thoughts too much OT, but as I said elsewhere, the PR-calculation formular is always the most critical item from a mathematical and a performance point of view. BTW, did you notice this one in the second patent:

[0008] Another problem with conventional information retrieval systems is that they can only index a relatively small portion of the documents available on the Internet. It is currently estimated that there are over 200 billion pages on the Internet today...

Has anyone calculated what this would mean for the original PR formular? That is a hell of a lot, another eight bits beyond 32, on both paths of the two embedded loops.

So in order to get back to topic:

1) Does pagerank play any role at all still, considering what you, Tedster, quoted about relevance-evaluation of phrase-analysis alone?

2) If it does play a role: Where does it hook in during the relevance evaluation, and where and when in the infrastucture is it calculated?

3) Is (old) pagerank-data nowadays perhaps only used to define crawl-depth and crawl-speed, thus important meta-coefficients of the analysis of large chunks of the internet (10 mio pages each or one harddrive full, as said in the patents), thereby passing some initial variables to the calculation of local-rank and further phrase-based-analysis?


Thread source:: http://www.webmasterworld.com/google/3247207.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com