A brain teaser challenge - Comparing 2 long strings - PHP Server Side Scripting forum at WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

A brain teaser challenge - Comparing 2 long strings

musicales

11:11 pm on Jun 16, 2003 (gmt 0)

I've got two fairly long strings - imagine for example they were two sections of DNA code, say 100 characters long. Anyone got any idea how I could compare the two and give a % score for relevance. I would want the first string to be compared against a database full of long strings, so it needs to be efficient, and the only way I can think at the moment is to loop through lots of small sections of the string and see how they match - very inefficient.

That's my little brain teaser challenge for you. Any help much appreciated.

marcs

11:17 pm on Jun 16, 2003 (gmt 0)

Depends on how relevance is to be measured. A very basic method could be :

relevance=0;
for(loop=0;loop<100;loop++)
{
if(longstring1[loop]==longstring2[loop])
relevance++;
}

A perfect match would give you 100% relevance. If only the first half matches, you get 50% relevance.

The answer to your question really depends on how you define relevance. With this method, two equal strings which are reverted in order would result in low relevance while they are actually very similar.

marcs

11:17 pm on Jun 16, 2003 (gmt 0)

Guess I actually provided the example you had already considered and know to be ineffective, sorry :)

musicales

11:43 pm on Jun 16, 2003 (gmt 0)

marcs - as you say, your method would show no relevance if there was one extra character right at the start, or one missing. I would ideally like to be able to take such small variations into account.

grahamstewart

12:02 am on Jun 17, 2003 (gmt 0)

Levenshtein Distance: the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2.

This sounds like what you want - but I'm afraid the complexity means that it probably wouldn't be practical to use on a large database.

See the levenshtein() function in the PHP manual for more info.

musicales

9:08 am on Jun 17, 2003 (gmt 0)

many thanks graham - that looks spot on, will look in to it.

victor

10:15 am on Jun 17, 2003 (gmt 0)

Another method which may be relevant:

Compress the first string (use an API to pkzip or some such)

Concatenante the second string to the (original, uncompressed) first string, and compress the combined string.

Compare the two compressed lengths. How much longer the compressed combined string is than the compressed first string is a measure of how different they are.

There is no ideal method for comparing similarities -- it depends on the type of difference. Both grahamstewart's and my suggestion do badly on strings that are the reverse of each other: abcdefghij vs jihgfedcba, although they are obviously very closely related.

Interesting, and related article from Scientific Amercian on the topic:
[sciam.com...]

musicales

11:10 am on Jun 17, 2003 (gmt 0)

thanks victor
I guess part of the problem I have is not so much the backwards issue, but that a string in the database might be 100 characters long, whereas the string to compare might be only 10 or 20. Using the Levenshtein Distance which I've looked into (and is a very big step in the right direction) - this would return a difference of 80 or 90 which is not quite right. I would need to do 100 Levenshtein Distances starting from each point in the 100 length string - that would be a nightmare!

Shame the SI article is subscription - you wouldn't care to summarise it? ;)

aspr1n

2:22 pm on Jun 17, 2003 (gmt 0)

ohh how about this for an off the cuff thought, I have done something similar stuff with C not PHP but this might just work.

This would require some extra thought to implement, but would be extremely efficient, very simple, and very fast - O(1)!

Create an array of hashed strings - get a good hash function that gives a good distribution.

Hash all the strings and store in an array, probably with bucket chaining for collisions.

To compare string1 to string2 - hash them both, get an index into the array to match, the distance between the two indicies in the hash table is your match probability.

The tricky part of this would be creating a hash function that didn't give you a "random" distribution - I have seen them out on the net, but never used them as usually that's the last thing you want!

The perhaps easier refinement might be to only hash match a construct of the strings to get a close match then work the long way round mentioned already, which wouldn't be O(1), but would certainly still be faster.

Certainly can't see why this wouldn't be possible.

asp

[edit]
An additional thought, just in case this really is about matching DNA sequences, arn't there only 4 letter possibilities - ACTG? If this is the case I would have thought this massively increases the predicative possibilities:

Say: A = 1, C = 2, G = 3, T = 4
.: ACGT = 10
.: AAAA = 4
.: TTTT = 12

P'raps hash a 12 letter combination and use as a lookup table ( 4¹² * 2bits memory ) . Sorry my maths isn't upto much, so not sure where that gets you but it looks interesting!
[/edit]

mcc235

12:57 pm on Jun 20, 2003 (gmt 0)

Have you looked at Vector Space models? I am not sure if this is applicable, but basically you turn each string into a vector, then use single value decomposition to find the relative distance between strings.

There is a great article [perl.com] that explains how to create a very small fast Perl script for creating your own search engine. Even though it is Perl I think this would adapt well to what you are doing as the PHP coding would not be too difficult.

The next logical step is Latent Semantic Indexing [nitle.org], which they use in "protein structure prediction using algorithms adapted from a text search engine".

Hope this helps.

brotherhood of LAN

1:24 pm on Jun 20, 2003 (gmt 0)

The levenshtein() function graham mentioned is handy, I used that for a makeshift spellchecker... a similar function is similar_text()

from manual [php.net]

This calculates the similarity between two strings as described in Oliver [1993]. Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.
By passing a reference as third argument, similar_text() will calculate the similarity in percent for you. It returns the number of matching chars in both strings.

musicales

8:07 am on Jun 21, 2003 (gmt 0)

mcc235 and bol - many thanks - I'll do some work on your suggestions and see what works best.

jaski

8:37 am on Jun 21, 2003 (gmt 0)

To me it sounds some thing similar to the way google ranks pages for a given keyword combination.

To make the process efficient you probably need to identify some parameters in the strings which can be given values .. and run a program on the entire set of strings once which assigns values for each parameter to each of those strings .. if that can be done, it can avoid going through an entire cycle of matching every string every time you want to do it.

like a search engine assigns relevance/value for each keyword/key phrase, for each page in the index?

musicales

10:34 am on Jun 21, 2003 (gmt 0)

jaski - I've realised that actually what I'm after is only three different characters - up down and same, so it's actually a long string like 0120200200201212020201212
or whatever. How can I give parameters to something like that?

aspr1n - I didn't quite get what you said about the predictive possibilities is the number of characters is so limited.

Thanks again everyone

jaski

12:29 pm on Jun 21, 2003 (gmt 0)

How can I give parameters to something like that?

It depends on how your relevance is measured ..

say percentage of 0s,1s and 2s indicates some similarty in strings .. then it could be one of the parameters.

another way, let us say we look at the similarities between substrings taking 3 consecutive *digits* (for the lack of better word) at time. For a 100 digit string we will have 98 such substrings(position 1.2.3, 2.3.4 etc) . When we consider 3 consecutive digits .. there can be 27 different variations of 3 digit strings.

if we assume each one of those to be a basic matching parameter .. we have 27 parameters .. now we read each of the 98 substrings and see how many of those match with our 27 parameters.

if this makes sense .. it can be done for 2 digit (which will have 9 parameters) or by taking higher number digit substrings as parameters.

In the end your algo could be matching many things which could have different weight.

musicales

12:40 pm on Jun 21, 2003 (gmt 0)

jaski
thanks - yes that's actually exactly how I've been doing things so far. Trouble is of course, you have 27*98 look ups for just 1 record. As I'd like to be able to search 10,000+ records in google style speed, it seems impractical.

A friend just mentioned eigon values to me. Anyone know anything about them? Off to google it now.

brotherhood of LAN

1:46 pm on Jun 21, 2003 (gmt 0)

>>I would want the first string to be compared against a database full of long strings, so it needs to be efficient, and the only way I can think at the moment is to loop through lots of small sections of the string and see how they match

Just as an alternative, take the mountain to mohammed? ;)

If you had an INT representative of the string, you could look up that INT in the database with a simple SELECT query, and then iterate through these with PHP for the right match.

i.e. a string of "123123123", convert it to base 10 -> base_convert(123123123,4,10) -> 112347 -> SELECT * FROM db WHERE int = 112347

or thereabouts....the idea is that you'll have only one or two strings matching a base 10 number, and you can then query the DB for the number to get the strings to compare.

If you used base_convert(), and retrieved strings from the DB, you'd know the first characters in each string are the same, so I'd start comparing them by length or from the last character. I guess thats one way to do it :)

musicales

3:49 pm on Jun 21, 2003 (gmt 0)

bal - that's a neat idea! Trouble is, if I understand what you're saying right, the strings would have to be an exact match. My 'string to compare' might have be just a little chunk out of the middle of a larger one and so forth - so that wouldn't work I don't think.

I think mcc235 is nearest the mark so far with the 'Vector Space models' (this is what my friend who's a Wellcome Fellow at Oxford Uni and studying gene sequences suggested too).

The basic idea seems to be you compile all the occurrences of, say 1-2 1-1 1-3 and so on, and then plot a graph detailing the outcome (eg 15 lots of 1-2, 8 lots of 1-1 etc) . You then compare the curve of these graphs using cosine. Sounds a bit mind boggling to me, but may give it a go. Actually what sounds even more mind boggling is doing the same thing, but in three (or more) dimensions. If any one can point me to any coded examples of this kind of stuff (in any language) I'd be really grateful.

brotherhood of LAN

5:19 pm on Jun 21, 2003 (gmt 0)

agreed the vector space model thing is probably by far the best way, I can't offer much on it ;) You might want to search for genetic algo's as well as vector space models. The description of what you gave almost sounds like the way you would huffman encode a doc, i.e. find all unique characters, find the count of them and re-encode depending on frequency....well similarish ;) Many of the huff encoding tutorials I read used A-G-T-C in their examples.

>>>the strings would have to be an exact match

Yes, you'd look up an INT, and 1 or more strings would be returned, since the INT is the same the start of the string will be the same, so you'd check the middle/end or length of the string to find which one you want.

mcc235

2:00 pm on Jun 22, 2003 (gmt 0)

This is a poor example, but hopefully it will give you an idea. Let's take a 10 character string where each character can have one of four values (A,G,C, or T). So you would create a 10x4 matrix for every string (all the ones in your database as well as your search string):

A: [0 1 0 1 0 0 0 1 0 1]
G: [0 1 0 1 1 1 0 1 0 1]
C: [0 0 0 0 0 0 0 1 0 1]
T: [1 1 0 1 0 0 1 1 0 1]

You can then do some standard matrix algebra to calculate the cosine between two vectors, the higher the cosine the closer the match. That way you could sort them by cosine and they would be automatically ranked. This is basically the same as the search engine article [perl.com] I mentioned before, except instead of storing words from a web page we store A, G, C, and T in a multi-dimensional array.

If you see the link I posted before (LSI), there is another link on that page to a layman's description [javelina.cet.middlebury.edu] of LSI. This has a good overview of the whole process. Skip ahead a few pages and read the paragraph on this page [javelina.cet.middlebury.edu] about the fish tank. Might help explain how to map anything in multi dimensions and what singular value decomposition (SVD) does.

Remember the LSI thing is a little complex because it uses SVD, I would suggest just doing the cosine calculation for now then try SVD later. Hope this helps!

musicales

3:21 am on Jun 23, 2003 (gmt 0)

mcc235 - thanks for your help, yes I've been looking into the article you mentioned - it's very good indeed.

The bit I'm confused about is how you find the cosine between the two vectors when they have lots of zeros in them.

I know the article used a separate but of coding, (the 'piddle' wasn't it?), but I'd like to be able to understand it.

I saw that to get the angle between 2 vectors it's basically

x1 *x2 + y1 *y2

and I'm pretty sure (correct me if I'm wrong) that for a 10 dimensional vector like your example it would just be

x1 * x2 * x3 etc + y1 * y2 * y3 etc

but surely if half the values are zeros it won't work?

Thanks for your help.

mcc235

4:04 am on Jun 23, 2003 (gmt 0)

Your on the right track. From the article:

The formula for calculating the cosine is this:
cos = ( V1 * V2 ) / ��V1�� x ��V2��
Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm, and the * indicates the inner product. You can take the math on faith, or look it up in any book on linear algebra.

Some matrix algebra help can be found here [freespace.virgin.net]. There was a PHP class [crnec.auckland.ac.nz] for matrix algebra, but it's not there today, maybe you could contact the author. I don't think it had everything you need but it was a good start.

You will have to trust matrix algebra from here as all those zeros will not matter. I have written a quick and dirty PHP implementation of the article. When you run a search with only one search word (out of 700 words in the web pages) it creates a search vector with 1 one and 699 zeros but it works fine. Good luck, if you cure cancer or something let me know! ;)

musicales

8:02 am on Jun 23, 2003 (gmt 0)

I'm not quite clear here what the 2-norm is. If it's the same as on the algebra page you showed, this would mean I have to take the square root of (x*x + y*y ..etc) for every vector, which, to do 10,000 times for one search would surely be very resource intensive?

cure cancer - next on my list ;) For now, something a bit more mundane, fraid I can't really say what it is, but your help is very much appreciated.

musicales

8:51 am on Jun 23, 2003 (gmt 0)

OK I take the last post back - I've just done a test on a 25 dimension vector, searching 10,000 rows and it took about 3 seconds - that'll do me fine!

I'm very interested in the SVD stuff, although I think I'll play around with the cosines for a while first. What I couldn't follow in their explanation of SVD [citeseer.nj.nec.com] was how they got from their three matrixes produced by the SVD to the 'revised' matrix called Xhihat - they said just multiply out the three matrices TSD' but I wasn't clear how to do that.

You've already given me enough help, so if you can't face this one don't worry for now!

mcc235

9:59 am on Jun 23, 2003 (gmt 0)

musicales - I don't think I will be too much help from here, you are about up to speed where I am. I only found the mentioned articles a few weeks ago, so I am learning as well. I am slowly working through the SVD stuff, but haven't had too much time to spend on it. Perhaps if you have some success you could share some ideas back through StickyMail. If I find anything else I will let you know. Good luck!

musicales

10:27 am on Jun 23, 2003 (gmt 0)

OK thanks - looking forward to your cure for cancer!

mikejson

4:37 pm on Jun 26, 2003 (gmt 0)

Heh, I've never posted in here before but here goes my first one... and a crazy thought. I forget how efficent it is...

Basically I worked on a neural network alogrithm. You might want to look at something like that. Some of them with even large numbers would take a few seconds to map, then you would map the str1 and str2 into the map and find the distance between them. That would give you a number that would easily be converted into some percentage "closeness".

Like I said, I forgot a bit about it, but if I can find my notebook on this I'll post some of it. But if you really need something acurrate(not bad like my spelling), then I would look into this. There is plenty of example mapping programs on the net(from what I recall).

:) good luck on understanding some of them....