Forum Moderators: coopster
That's my little brain teaser challenge for you. Any help much appreciated.
relevance=0;
for(loop=0;loop<100;loop++)
{
if(longstring1[loop]==longstring2[loop])
relevance++;
}
A perfect match would give you 100% relevance. If only the first half matches, you get 50% relevance.
The answer to your question really depends on how you define relevance. With this method, two equal strings which are reverted in order would result in low relevance while they are actually very similar.
This sounds like what you want - but I'm afraid the complexity means that it probably wouldn't be practical to use on a large database.
See the levenshtein() function in the PHP manual for more info.
Compress the first string (use an API to pkzip or some such)
Concatenante the second string to the (original, uncompressed) first string, and compress the combined string.
Compare the two compressed lengths. How much longer the compressed combined string is than the compressed first string is a measure of how different they are.
There is no ideal method for comparing similarities -- it depends on the type of difference. Both grahamstewart's and my suggestion do badly on strings that are the reverse of each other: abcdefghij vs jihgfedcba, although they are obviously very closely related.
Interesting, and related article from Scientific Amercian on the topic:
[sciam.com...]
Shame the SI article is subscription - you wouldn't care to summarise it? ;)
This would require some extra thought to implement, but would be extremely efficient, very simple, and very fast - O(1)!
Create an array of hashed strings - get a good hash function that gives a good distribution.
Hash all the strings and store in an array, probably with bucket chaining for collisions.
To compare string1 to string2 - hash them both, get an index into the array to match, the distance between the two indicies in the hash table is your match probability.
The tricky part of this would be creating a hash function that didn't give you a "random" distribution - I have seen them out on the net, but never used them as usually that's the last thing you want!
The perhaps easier refinement might be to only hash match a construct of the strings to get a close match then work the long way round mentioned already, which wouldn't be O(1), but would certainly still be faster.
Certainly can't see why this wouldn't be possible.
asp
[edit]
An additional thought, just in case this really is about matching DNA sequences, arn't there only 4 letter possibilities - ACTG? If this is the case I would have thought this massively increases the predicative possibilities:
Say: A = 1, C = 2, G = 3, T = 4
.: ACGT = 10
.: AAAA = 4
.: TTTT = 12
P'raps hash a 12 letter combination and use as a lookup table ( 412 * 2bits memory ) . Sorry my maths isn't upto much, so not sure where that gets you but it looks interesting!
[/edit]
There is a great article [perl.com] that explains how to create a very small fast Perl script for creating your own search engine. Even though it is Perl I think this would adapt well to what you are doing as the PHP coding would not be too difficult.
The next logical step is Latent Semantic Indexing [nitle.org], which they use in "protein structure prediction using algorithms adapted from a text search engine".
Hope this helps.
from manual [php.net]
This calculates the similarity between two strings as described in Oliver [1993]. Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.By passing a reference as third argument, similar_text() will calculate the similarity in percent for you. It returns the number of matching chars in both strings.
To make the process efficient you probably need to identify some parameters in the strings which can be given values .. and run a program on the entire set of strings once which assigns values for each parameter to each of those strings .. if that can be done, it can avoid going through an entire cycle of matching every string every time you want to do it.
like a search engine assigns relevance/value for each keyword/key phrase, for each page in the index?
aspr1n - I didn't quite get what you said about the predictive possibilities is the number of characters is so limited.
Thanks again everyone
How can I give parameters to something like that?
It depends on how your relevance is measured ..
say percentage of 0s,1s and 2s indicates some similarty in strings .. then it could be one of the parameters.
another way, let us say we look at the similarities between substrings taking 3 consecutive *digits* (for the lack of better word) at time. For a 100 digit string we will have 98 such substrings(position 1.2.3, 2.3.4 etc) . When we consider 3 consecutive digits .. there can be 27 different variations of 3 digit strings.
if we assume each one of those to be a basic matching parameter .. we have 27 parameters .. now we read each of the 98 substrings and see how many of those match with our 27 parameters.
if this makes sense .. it can be done for 2 digit (which will have 9 parameters) or by taking higher number digit substrings as parameters.
In the end your algo could be matching many things which could have different weight.
A friend just mentioned eigon values to me. Anyone know anything about them? Off to google it now.
Just as an alternative, take the mountain to mohammed? ;)
If you had an INT representative of the string, you could look up that INT in the database with a simple SELECT query, and then iterate through these with PHP for the right match.
i.e. a string of "123123123", convert it to base 10 -> base_convert(123123123,4,10) -> 112347 -> SELECT * FROM db WHERE int = 112347
or thereabouts....the idea is that you'll have only one or two strings matching a base 10 number, and you can then query the DB for the number to get the strings to compare.
If you used base_convert(), and retrieved strings from the DB, you'd know the first characters in each string are the same, so I'd start comparing them by length or from the last character. I guess thats one way to do it :)
I think mcc235 is nearest the mark so far with the 'Vector Space models' (this is what my friend who's a Wellcome Fellow at Oxford Uni and studying gene sequences suggested too).
The basic idea seems to be you compile all the occurrences of, say 1-2 1-1 1-3 and so on, and then plot a graph detailing the outcome (eg 15 lots of 1-2, 8 lots of 1-1 etc) . You then compare the curve of these graphs using cosine. Sounds a bit mind boggling to me, but may give it a go. Actually what sounds even more mind boggling is doing the same thing, but in three (or more) dimensions. If any one can point me to any coded examples of this kind of stuff (in any language) I'd be really grateful.
>>>the strings would have to be an exact match
Yes, you'd look up an INT, and 1 or more strings would be returned, since the INT is the same the start of the string will be the same, so you'd check the middle/end or length of the string to find which one you want.
A: [0 1 0 1 0 0 0 1 0 1]
G: [0 1 0 1 1 1 0 1 0 1]
C: [0 0 0 0 0 0 0 1 0 1]
T: [1 1 0 1 0 0 1 1 0 1]
You can then do some standard matrix algebra to calculate the cosine between two vectors, the higher the cosine the closer the match. That way you could sort them by cosine and they would be automatically ranked. This is basically the same as the search engine article [perl.com] I mentioned before, except instead of storing words from a web page we store A, G, C, and T in a multi-dimensional array.
If you see the link I posted before (LSI), there is another link on that page to a layman's description [javelina.cet.middlebury.edu] of LSI. This has a good overview of the whole process. Skip ahead a few pages and read the paragraph on this page [javelina.cet.middlebury.edu] about the fish tank. Might help explain how to map anything in multi dimensions and what singular value decomposition (SVD) does.
Remember the LSI thing is a little complex because it uses SVD, I would suggest just doing the cosine calculation for now then try SVD later. Hope this helps!
The bit I'm confused about is how you find the cosine between the two vectors when they have lots of zeros in them.
I know the article used a separate but of coding, (the 'piddle' wasn't it?), but I'd like to be able to understand it.
I saw that to get the angle between 2 vectors it's basically
x1 *x2 + y1 *y2
and I'm pretty sure (correct me if I'm wrong) that for a 10 dimensional vector like your example it would just be
x1 * x2 * x3 etc + y1 * y2 * y3 etc
but surely if half the values are zeros it won't work?
Thanks for your help.
The formula for calculating the cosine is this:
cos = ( V1 * V2 ) / ¦¦V1¦¦ x ¦¦V2¦¦
Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm, and the * indicates the inner product. You can take the math on faith, or look it up in any book on linear algebra.
Some matrix algebra help can be found here [freespace.virgin.net]. There was a PHP class [crnec.auckland.ac.nz] for matrix algebra, but it's not there today, maybe you could contact the author. I don't think it had everything you need but it was a good start.
You will have to trust matrix algebra from here as all those zeros will not matter. I have written a quick and dirty PHP implementation of the article. When you run a search with only one search word (out of 700 words in the web pages) it creates a search vector with 1 one and 699 zeros but it works fine. Good luck, if you cure cancer or something let me know! ;)
cure cancer - next on my list ;) For now, something a bit more mundane, fraid I can't really say what it is, but your help is very much appreciated.
I'm very interested in the SVD stuff, although I think I'll play around with the cosines for a while first. What I couldn't follow in their explanation of SVD [citeseer.nj.nec.com] was how they got from their three matrixes produced by the SVD to the 'revised' matrix called Xhihat - they said just multiply out the three matrices TSD' but I wasn't clear how to do that.
You've already given me enough help, so if you can't face this one don't worry for now!
Basically I worked on a neural network alogrithm. You might want to look at something like that. Some of them with even large numbers would take a few seconds to map, then you would map the str1 and str2 into the map and find the distance between them. That would give you a number that would easily be converted into some percentage "closeness".
Like I said, I forgot a bit about it, but if I can find my notebook on this I'll post some of it. But if you really need something acurrate(not bad like my spelling), then I would look into this. There is plenty of example mapping programs on the net(from what I recall).
:) good luck on understanding some of them....