Forum Moderators: coopster
Explanation of the method on en.wikipedia.org [en.wikipedia.org]
My approach is as follows:
1) Extract all text information out of the text and split it into single words which are then stemmed => Array containing the word stems [split()]
2) delete stop words ("and", "but", ...) [foreach-loop: check if in stop words array]
3) Calculate weights for the terms:
foreach ($woerter as $wort) {
$freq = number_of_$wort_in_this_document;
$max = number_of_most_frequent_word_in_this_document;
$doc1 = number_of_documents_in_database;
$doc2 = number_of_documents_in_database_containing_$wort;
$weight = $freq/$max * log($doc1/$doc2);
}
The loop should just act as counting and matching mechanism so you don't need to set constants within it or these will be set each time or even worse reset.
The loop also does not contain a conditional to check for any match.
I think you might be best off sketching this out on paper first to get the general principle in order.
After this code ...
[PHP]
$term_document_matrix = array();
foreach ($words as $word) {
// $freq is the number of occurrences of $word in this document;
// $max is the number of occurrences of the most frequent word in this document;
// $doc1 is the number of documents in database;
// $doc2 is the number of documents in database containing $word;
$weight = $freq/$max * log($doc1/$doc2);
$term_document_matrix[$word] = $weight;
}
[/PHP]
... I should have an array with all words as the keys and the weights as the values.
But there's still a lot of work to do then, isn't it?
Wikipedia says that I should use singular-value decomposition to split the term document matrix into three components:
( A = U * S * V )
The orthogonal matrices U and V contain eigenvectors of AtA and AAt, S is a diagonal matrix with the roots of the eigenvalues of AtA, also called singular values.
Using the eigenvalues in the created matrix S, you can control the linear feature extraction by successively leaving out the smallest eigenvalue until reaching the indefinite limit k.
But my problem is: I don't know how to code what is said in the article of Wikipedia. Could you please help me to find an approach for coding that?
coopster said:if you are having trouble converting the conversational/instructional english text to code
DNnetz said:The problem for me is not how to code but how to convert the text into program code.
I'm not questioning your ability to code, not at all. Nor am I questioning your intelligence. Quite the contrary, especially when I noticed the discussion topic. You obviously have both initiative and intelligence.
Let me explain myself a bit more here using personal experience as an analogy. I'm not a mathematician but consider myself competent enough to understand complex expressions. Years ago I decided to write a distance calculator. I did my research and discovered many different formulas were available for calculating distance on a spherical surface. Then I came across a formula and discussion amongst some great minds initiated by a fella that worked for NASA. I found a winner. Next up, converting the formula to code and integrating it with a latitude/longitude database table. It took me quite a few hours to get it properly coded and returning results correctly, including indexing my tables for performance.
The "few hours to get it properly coded" part is where you are at right now. My point in this discussion is that most everybody in the forum works a regular job and volunteers their efforts here when they are able. To evaluate and begin breaking your LSA text down is going to take time. I'm guessing the lack of response to your discussion here is one of two reasons -- either readers lack the ability to undertake such or task or they just don't have the time to freely donate their efforts. And knowing the members of this community, I'm banking on the latter reason.
I hope this makes sense, DNnetz.
LSA is hard to scale and PHP isn't exactly the best language for high performance analysis of this kind: don't expect to spend a few hours on it and get results - I don't think this is the right forum for this sort of questions also.
If you do this sort of stuff you are essentially on your own and if you want to succeed you will need to get used to it.
Good luck.