Latent semantic analysis in PHP

Forum Moderators: coopster

Message Too Old, No Replies

Latent semantic analysis in PHP

DNnetz

7:43 pm on Mar 4, 2009 (gmt 0)

I want to code an application which uses "Latent semantic analysis" to extract the most important keywords out of a text.

Explanation of the method on en.wikipedia.org [en.wikipedia.org]

My approach is as follows:
1) Extract all text information out of the text and split it into single words which are then stemmed => Array containing the word stems [split()]
2) delete stop words ("and", "but", ...) [foreach-loop: check if in stop words array]
3) Calculate weights for the terms:


foreach ($woerter as $wort) {
 $freq = number_of_$wort_in_this_document;
 $max = number_of_most_frequent_word_in_this_document;
 $doc1 = number_of_documents_in_database;
 $doc2 = number_of_documents_in_database_containing_$wort;
 $weight = $freq/$max * log($doc1/$doc2);
}

I know, a lot of code is still missing. But I don't know how to code the rest. Could you please help me? That would be great. Thanks in advance!

optik

4:46 pm on Mar 6, 2009 (gmt 0)

You really need to put this into a function and the way you've compiled the for each loop is a bit wrong.

The loop should just act as counting and matching mechanism so you don't need to set constants within it or these will be set each time or even worse reset.

The loop also does not contain a conditional to check for any match.

I think you might be best off sketching this out on paper first to get the general principle in order.

d40sithui

4:59 pm on Mar 6, 2009 (gmt 0)

Wow, this sounds complicated. Reminds of something from neural networks lol. It looks like you will have to use arrays as matrixes to represent your data which will be generated by the LSA algorithm. A good grasp of PHP arrays and LSA algorithm is what you should have to do this project.

DNnetz

1:37 pm on Mar 7, 2009 (gmt 0)

Thank you very much for your answers. I've changed the loop a bit:

After this code ...
[PHP]
$term_document_matrix = array();
foreach ($words as $word) {
// $freq is the number of occurrences of $word in this document;
// $max is the number of occurrences of the most frequent word in this document;
// $doc1 is the number of documents in database;
// $doc2 is the number of documents in database containing $word;
$weight = $freq/$max * log($doc1/$doc2);
$term_document_matrix[$word] = $weight;
}
[/PHP]
... I should have an array with all words as the keys and the weights as the values.

But there's still a lot of work to do then, isn't it?
Wikipedia says that I should use singular-value decomposition to split the term document matrix into three components:
( A = U * S * V )
The orthogonal matrices U and V contain eigenvectors of AtA and AAt, S is a diagonal matrix with the roots of the eigenvalues of AtA, also called singular values.
Using the eigenvalues in the created matrix S, you can control the linear feature extraction by successively leaving out the smallest eigenvalue until reaching the indefinite limit k.

But my problem is: I don't know how to code what is said in the article of Wikipedia. Could you please help me to find an approach for coding that?

DNnetz

10:52 pm on Mar 13, 2009 (gmt 0)

Does nobody have an idea?
I've found another description of the steps of the algorithm - detailed and in English! Please look over it, maybe you can help me then and say how to code it:
[c2.com...]

coopster

11:06 pm on Mar 13, 2009 (gmt 0)

I don't know how else to say this but ... if you are having trouble converting the conversational/instructional english text to code, you may have to hire somebody. This project is not for the faint of heart and is going to require some commitment, which tends to equate to time and money. In short, you are not likely to find a quick solution to the issue in a forum -- teaching how to code it might take longer than coding it in and of itself.

DNnetz

11:32 pm on Mar 13, 2009 (gmt 0)

Thanks for your quick reaction. The problem for me is not how to code but how to convert the text into program code. I know the basics of PHP and I can even code quite difficult things but I don't understand the text with the instructions. I have to add Singular Value Decomposition and matrices to my PHP code and I don't know how. My deficiency seems to be the mathematical part ... I really want to commit and I don't want you to code it for me. I'm just looking for help because I have problems coding that.

coopster

10:19 am on Mar 14, 2009 (gmt 0)

Understood, and we are saying the same thing.

coopster said:if you are having trouble converting the conversational/instructional english text to code

DNnetz said:The problem for me is not how to code but how to convert the text into program code.

I'm not questioning your ability to code, not at all. Nor am I questioning your intelligence. Quite the contrary, especially when I noticed the discussion topic. You obviously have both initiative and intelligence.

Let me explain myself a bit more here using personal experience as an analogy. I'm not a mathematician but consider myself competent enough to understand complex expressions. Years ago I decided to write a distance calculator. I did my research and discovered many different formulas were available for calculating distance on a spherical surface. Then I came across a formula and discussion amongst some great minds initiated by a fella that worked for NASA. I found a winner. Next up, converting the formula to code and integrating it with a latitude/longitude database table. It took me quite a few hours to get it properly coded and returning results correctly, including indexing my tables for performance.

The "few hours to get it properly coded" part is where you are at right now. My point in this discussion is that most everybody in the forum works a regular job and volunteers their efforts here when they are able. To evaluate and begin breaking your LSA text down is going to take time. I'm guessing the lack of response to your discussion here is one of two reasons -- either readers lack the ability to undertake such or task or they just don't have the time to freely donate their efforts. And knowing the members of this community, I'm banking on the latter reason.

I hope this makes sense, DNnetz.

g1smd

10:52 am on Mar 14, 2009 (gmt 0)

It's also the weekend, and a number of forum participants are M-F/9-5 types.

Lord Majestic

11:09 am on Mar 14, 2009 (gmt 0)

LSA is a very niche topic that very few people work on, those who do are most likely to not be allowed to say anything on this subject at all due to NDAs.

LSA is hard to scale and PHP isn't exactly the best language for high performance analysis of this kind: don't expect to spend a few hours on it and get results - I don't think this is the right forum for this sort of questions also.

If you do this sort of stuff you are essentially on your own and if you want to succeed you will need to get used to it.

Good luck.

DNnetz

12:25 pm on Mar 14, 2009 (gmt 0)

Thank you for your answers!
Now I've understood why here's a lack of response. You're right, LSA seems to be a difficult topic and only few people have already coded this.

Again: Thank you for your responses. I'll just try to solve the problem by myself, even if it takes a lot of time ...