Lexical Similarity

After the last TPR update a number of changes have been observed. One of the most significant differences was, that somtimes pages with identical inbound-link-structure came up with different PR-values.

In order to analyse whether this might be explained by standard semantic measurement, I have begun to write a little PHP-script to determine the "semantic similarity" of two websites according to the cosine similarity function in word-vector-space as introduced by Salton & McGill (1983).

The results so far were satisfying in some respects, in others they were not, so I might need some help in improving the script:

0) I'd need some more examples for the significant changes described above: sides with identical inbound links (only one at best!) but different PR.

1) How are figures generally handled in information retrieval? For instance, I have my telephone-number and Postal code on almost every website, which elsewhere might be sorted out as irrelevant.

2) Have I probably done too much in sorting out all html-tags? What about the img-alt specifications? Lynx does show them. What else should be preserved? Does anyone have a good/short preg_replace-engine for extracting these alt-tags?

3) What about E-Mail-Adresses? are they split into two words after the @-Symbol, are they sorted out or left as one long word?

4) Term-Frequency is another matter: to cope with files of different lengths you take the number of occurances of a word, devide it by the total number of words and put that on a logarithmic scale. My source recommended the following formular:

tf(a) = ln(occ(a)+ 1) / ln (numtot)

Does this really make sense in an algorithm comparing two files of almost identical length? these formulars were developed for investigating the relationships between a short query-phrase and a long file to be searched.

5) Similar arguments hold true for "inverted document frequency." I left this out so far, because it would require a large database, which i have no disk space for (at least not online). I found a freely available database at berkley university, but it's about half a gig in size and I'm wondering about performance issues under php if I try to swallow that in.

6) If any lexical filtering IS an issue since googles last update: On what basis would it most likely have been implemented? I think many of you might know about that far better than me. Maybe someone has good arguments all this is wasted time.

I ftp-ed the script, an introducing html-form, the stopp-word.txt file and a txt-version of the source code to my website, but I'm not familiar with posting urls (especially my own) in this NG, so if you are interested in this topic, send me a mail; or perhaps someone else might want to moderate this thread.

Lexical Similarity

Does PR-calculation now depend on content?

Oliver Henniges

doc_z

Oliver Henniges

MHes

doc_z

Oliver Henniges

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week