Forum Moderators: open

Message Too Old, No Replies

Lexical Similarity

Does PR-calculation now depend on content?

         

Oliver Henniges

3:24 pm on Oct 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



After the last TPR update a number of changes have been observed. One of the most significant differences was, that somtimes pages with identical inbound-link-structure came up with different PR-values.

In order to analyse whether this might be explained by standard semantic measurement, I have begun to write a little PHP-script to determine the "semantic similarity" of two websites according to the cosine similarity function in word-vector-space as introduced by Salton & McGill (1983).

The results so far were satisfying in some respects, in others they were not, so I might need some help in improving the script:

0) I'd need some more examples for the significant changes described above: sides with identical inbound links (only one at best!) but different PR.

1) How are figures generally handled in information retrieval? For instance, I have my telephone-number and Postal code on almost every website, which elsewhere might be sorted out as irrelevant.

2) Have I probably done too much in sorting out all html-tags? What about the img-alt specifications? Lynx does show them. What else should be preserved? Does anyone have a good/short preg_replace-engine for extracting these alt-tags?

3) What about E-Mail-Adresses? are they split into two words after the @-Symbol, are they sorted out or left as one long word?

4) Term-Frequency is another matter: to cope with files of different lengths you take the number of occurances of a word, devide it by the total number of words and put that on a logarithmic scale. My source recommended the following formular:

tf(a) = ln(occ(a)+ 1) / ln (numtot)

Does this really make sense in an algorithm comparing two files of almost identical length? these formulars were developed for investigating the relationships between a short query-phrase and a long file to be searched.

5) Similar arguments hold true for "inverted document frequency." I left this out so far, because it would require a large database, which i have no disk space for (at least not online). I found a freely available database at berkley university, but it's about half a gig in size and I'm wondering about performance issues under php if I try to swallow that in.

6) If any lexical filtering IS an issue since googles last update: On what basis would it most likely have been implemented? I think many of you might know about that far better than me. Maybe someone has good arguments all this is wasted time.

I ftp-ed the script, an introducing html-form, the stopp-word.txt file and a txt-version of the source code to my website, but I'm not familiar with posting urls (especially my own) in this NG, so if you are interested in this topic, send me a mail; or perhaps someone else might want to moderate this thread.

doc_z

6:56 pm on Oct 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After the last TPR update a number of changes have been observed. One of the most significant differences was, that somtimes pages with identical inbound-link-structure came up with different PR-values.

In order to analyse whether this might be explained by standard semantic measurement, I have begun to write a little PHP-script to determine the "semantic similarity" of two websites according to the cosine similarity function in word-vector-space as introduced by Salton & McGill (1983).

I don't think that Google is mixing structure and content information. Also, it would be non trivial to define such a PR model. (What does content dependend mean: a content dependent damping factor or a non-equal distribution of PR?) Finally, the calculations for such a model would be very time consuming.

I'm also seeing significant changes in the PR displayed in the toolbar. [webmasterworld.com] For example, I have a PR4 test page linking to 9 other pages - 7 are PR3, 1 is PR2 and 1 is PR1. I haven't found an explanation for this (although I have studied many possible modifications, e.g. a damping factor which depends on the age of the link). Even a content dependent model wouldn't be a solution (at least for the cases I'm studying).

I guess I have to wait at least one more PR update to see what's going on.

Oliver Henniges

7:01 am on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You know I contibuted to the thread you just quoted, and as I wrote in my last post:

> they also could add additional coefficients to the values before the iteration process.

This is what I'd suggest most likely for the moment:
Backlinks are analysed as to their lexical relevance and below a certain value a PR0 if not a punishment is inherited. As a side-effect this would massively reduce the nnumber of calculations within the iteration process and thus positively influence performance issues discussed elsewhere (if there are any).

MHes

7:20 am on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>pages with identical inbound-link-structure came up with different PR-values

How do you know what the real PR is and how many links Google is counting? I would never believe the toolbarPR and the 'links in' tool is a random snapshot (probably way out of date). Assuming pages have an identical link structure is a big assumption. Within days a page may have been listed by a small directory and knowing this may be damn near impossible.

doc_z

1:23 pm on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Backlinks are analysed as to their lexical relevance and below a certain value a PR0 if not a punishment is inherited. As a side-effect this would massively reduce the nnumber of calculations within the iteration process and thus positively influence performance issues discussed elsewhere (if there are any).

This might reduce the (CPU) time for one iteration but it won't reduce the number of iterations. Most likely this would slightly increase the number of iterations.

How do you know what the real PR is and how many links Google is counting? I would never believe the toolbarPR and the 'links in' tool is a random snapshot (probably way out of date). Assuming pages have an identical link structure is a big assumption.

If you have built several similar (artificial) linking structures with thousands of test pages just for the purpose of PR measurement, you can be almost sure (especially, when one is performing these tests for a very long time and compare them with old data).

Oliver Henniges

2:03 pm on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> How do you know what the real PR is and how many links Google is counting?

The small version vice versa works as well: I have four pages ftped in summer, which showed first visitors by the end of september and got PR by Oct, 6th. All are linked to only by the main index page and the sitemap, and a search for their url in quotas by google and other search-engines shows no other backlinks. They came up with different PR-values, and I want to know why.

I belive that the value google shows in the toolbar is indeed the value google uses for its calculations, though with that specific timelag betweeen end september and october. But of course i conceed its a belief.

The latter part of your questions 'how many of the backlinks are counted' is exactly what we are trying to find out at the moment, because it seems that for the first time not all backlinks are valued equally.