Forum Moderators: coopster & phranque

Message Too Old, No Replies

Comparing number sets to an original set

how to?

         

Damian

9:37 pm on Jul 27, 2003 (gmt 0)

10+ Year Member



I want to write a script that compares sets of numbers, as in search engine positions.
The original set starts at the number one and increments one each line...simply from 1 to x.
I never understood maths while still in school. .. and now I'd like to know how close each set of numbers is to the original set and I'd appreciate suggestions on how to calculate that...

Here's are some example datasets, the first one is the original set, the other sets should be compared to the original set. All sets have the same amount of records, 10 in this example

1 2 3 4 5 6 7 8 9 10 #original set

1 3 4 2 5 7 6 8 9 10 #set 1
1 10 4 3 5 6 7 9 8 2 #set 2
2 3 4 5 6 7 8 9 10 1 #set 3

Set 1 has 5 numbers wrong
Set 2 has 5 numbers wrong
Set 3 has 10 numbers wrong

Set 3 has all the numbers wrong, but it's sequence has only one mistake. Record number 1 is probably a 'mistake' in the dataset and I wish to account for that possibility in my formula. Had record number 1 not been an exception then set 3 would have had a perfect score.

I think set 3 is closest to my original set, so I should probably look at the way the numbers increment, but that would also have to be done intelligently.. to recognize that 8-7-9 is not sequential but closer to sequential then 1-9-4

Any suggestions on how to put one number on each set which describes how close it is to the original set?
I'm posting this in the perl forum because I'm hoping there are modules/tricks to do this .. but just ideas for an approach would be great to.

Fischerlaender

10:41 am on Jul 28, 2003 (gmt 0)

10+ Year Member



I think you should change your definition of what makes a set "wrong". Instead of comparing the contents of the ten slots, you should count the numbers of basic operations that leaad from one set to another.

Basic operations would be the permutation of two elements or a shift of the whole data structure.

With such an definition you would get a more useful "similarity metric":
1 2 3 4 5 6 7 8 9 10 #original set
1 3 4 2 5 7 6 8 9 10 #set 1
1. swap slots 2 and 4
2. swap slots 3 and 4
3. swap slots 6 and 7
=> 3 basic operations

1 10 4 3 5 6 7 9 8 2 #set 2
1. swap slots 2 and 10
2. swap slots 3 and 4
3. swap slots 8 and 9
=> 3 basic operations

2 3 4 5 6 7 8 9 10 1 #set 3
1. shift the complete list one position to the left
=> 1 basic operation

I'm sure there are some algorithms out there to calculate the operations to get from your original set to your changed sets, but I don't know them.

killroy

10:59 am on Jul 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just calculate average error for a decent similarity measure.

for each number get the absolute difference in position.

so
sum(for each number{
dif=absolute(position1-position2)
})/10

then add them up and div them by 10

that should give you a good error measure.

so set 1 to set 2:
1 3 4 2 5 7 6 8 9 10 #set 1
1 10 4 3 5 6 7 9 8 2 #set 2

abs(1-1)=0
abs(3-10)=7
abs(4-4)=0
abs(2-3)=1
abs(5-5)=0
abs(7-6)=1
abs(6-7)=1
abs(8-9)=1
abs(9-8)=1
abs(10-2)=8
sum=20/10=2

average error=2

so set 1 to set 3:
1 3 4 2 5 7 6 8 9 10 #set 1
2 3 4 5 6 7 8 9 10 1 #set 3

abs(1-2)=1
abs(3-3)=0
abs(4-4)=0
abs(2-5)=3
abs(5-6)=1
abs(7-7)=0
abs(6-8)=2
abs(8-9)=1
abs(9-10)=1
abs(10-1)=9
sum=18/10=1.8

average error=1.8

so set 2 to set 3:
1 10 4 3 5 6 7 9 8 2 #set 2
2 3 4 5 6 7 8 9 10 1 #set 3

abs(1-2)=1
abs(10-3)=7
abs(4-4)=0
abs(3-5)=2
abs(5-6)=1
abs(6-7)=1
abs(7-8)=1
abs(9-9)=0
abs(8-10)=2
abs(2-1)=1
sum=16/10=1.6

average error=1.6

Makes sense?

SN

<ADDED>Just mentioned this method cos it's much easier to implement as amatrix based diff algorithm, unless of course you already have an implementation. I first learned it in a genetics credit and it took me ages to get my head around it. The good old average error measure is accurate and quick to implement.</ADDED>

[edited by: killroy at 11:03 am (utc) on July 28, 2003]

Fischerlaender

11:02 am on Jul 28, 2003 (gmt 0)

10+ Year Member



There's another idea which could help you. You could interpret your data set as a string and do a comparison by trigrams.

Trigrams are three successive characters. So the string "similarity" would consist of these trigrams:
sim, imi, mil, ila, lar, ari, rit, ity, tys, ysi

You can now define the similarity of two given strings by comparing the number of identical trigrams.

Applied to your situation, you could try something similiar. Every number in your set has to be regarded as a character. You can then build your trigrams and count the identical ones. You should do some experiments to find out what definiton of similarity is best for your purpose; here I showed two definitions: identical matches and macthes with permutations.
1 2 3 4 5 6 7 8 9 10 #original set
1-2-3, 2-3-4, 3-4-5, 4-5-6, 5-6-7, 6-7-8, 7-8-9, 8-9-10, 9-10-1, 10-1-2

1 3 4 2 5 7 6 8 9 10 #set 1
1-3-4, 3-4-2, 4-2-5, 2-5-7, 5-7-6, 7-6-8, 6-8-9, 8-9-10, 9-10-1, 10-1-3
=> 2 identical match (8-9-10, 9-10-1)
=> 5 matches with permutations (3-4-2, 5-7-6, 7-6-8, 8-9-10, 9-10-1)

1 10 4 3 5 6 7 9 8 2 #set 2
1-10-4, 10-4-3, 4-3-5, 3-5-6, 5-6-7, 6-7-9, 7-9-8, 9-8-2, 8-2-1, 2-1-10
=> 1 identical match
=> 4 matches with permutations

2 3 4 5 6 7 8 9 10 1 #set 3
2-3-4, 3-4-5, 4-5-6, 5-6-7, 6-7-8, 7-8-9, 8-9-10, 9-10-1, 10-1-2, 1-2-3
=> 10 identical matches
=> 10 matches with permutations

This should give you enough ideas to find your own algorithm for your problem.

claus

11:06 am on Jul 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure what meets the "wrong" criteria: is it 1) wrong placement of number, or 2) wrong number?

1) is a little trickier than 2). For the latter you just have to run ten sets of "=" and sum up the success rate.

1) requires a fixed set of options (numbers). Basically you are looking for the "mastermind" algo. I don't know the name, it's my own idea of a search term - i would think that there were some in the public domain, it's the "Master Mind" board game you're essentially doing. I programmed one for a pocket calculator once, it was just fun, and not a very efficient algo.

/claus

<added>it did the trick though. Above, efficient relates to scalability only, this one was a one-play-one-player.</added>

Damian

11:47 am on Jul 28, 2003 (gmt 0)

10+ Year Member



Wow, those are really great ideas... Thanks all!

I'll have to experiment with your suggestions before I can decide what would be best for my script. They all sound like they could be good approaches..

>what meets the "wrong" criteria: is it 1) wrong placement of number, or 2) wrong number?
Option 1, all numbers exist in each set, only once, but they can be on the wrong position
A nice comparison with 'Mastermind', but I think it's a little less complicated because Mastermind allows for numbers(colors) to occur twice if I remember it right.

Just to clarify what I have in mind

I basically want to run automated queries on my own simulated search engine....see how the results are ranked and compare the set to the original set retrieved from a real search engine.
The better the match.. the closer my own ranking algo would be the algo of the search engine..
It needs to be very scalable and implemented in a self learning system, the machine will have to try as many combinations as fast as possible and try to improve it's own best score.. :)

killroy

11:58 am on Jul 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see where you'Re coming from. reverse engineering google, aren't we? ;)

Well, you could run several distance measures and combine them too.
Any of these are quite fast in a tight implementation, and can be applied 1000s of times per second, even on a larger sample set.

Mathematically speaking the 2nd one would be fastest, especially on a fixed size set as you could even avoid loops and do most of it in-register:

(abs(set1[1]-set2[1])+abs(set1[2]-set2[2])
+abs(set1[3]-set2[3])+abs(set1[4]-set2[4])
+abs(set1[5]-set2[5])+abs(set1[6]-set2[6])
+abs(set1[7]-set2[7])+abs(set1[8]-set2[8])
+abs(set1[9]-set2[9])+abs(set1[10]-set2[10]))/10

SN

claus

12:37 pm on Jul 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now that sounds like an interesting project :) :) Re. Mastermind: You could in fact have one color (= digit in this case) duplicated any number of times =<max but limiting this just makes things easier.

Killroys method is well known and widely used. It produces results comparable to the concept of Variance, which is nice. You could go for the statistic version and compute it like this:

X = observed number
Y = reference number (expected number)

Variance(*) = SUM (X - Y)2

Then, use the Standard Deviation (**) as ranking measure if you want to get an idea of the "average deviation" - otherwise, just use the variance:

STD = SQRT (Var)

- i'm sure that modules computing standard deviation and variance are readily available.

/claus


(*) Note: This can be rewritten in a number of ways, eg. including probabilities, but those would sum up to 1, so they're omitted, as they were adding no extra information.
(**) Note: The square root of the variance

<edit>
LOL@below - that's exactly what makes WW such a nice place ;) I support killroys approach, the above was just "adding a perspective" - which wasn't needed and was comparatively costly in computation anyway. Won't delete the post though, someone might find it useful at another occasion.
</edit>

[edited by: claus at 1:26 pm (utc) on July 28, 2003]

killroy

12:49 pm on Jul 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I jsut used abs as it comes to the same without using costly floating point ops.

The above I can code up in a few lines of ASM and make it run circles around anything.

;) sorry, old school rearing it's ugly face.

SN

Damian

12:48 pm on Aug 2, 2003 (gmt 0)

10+ Year Member



Finally had some time to do some tests with this.. here's what I found.
I made a little script which outputs the results of the methods described here and some others (sticky me if you want a copy).. I found using Algorithm::Diff an easy way to use the 'diff' function, which is basically what Fischerlaender describes in post #2 I think.

Playing with module Algorithm::Diff
I noticed the function 'LCS' and it looks like this is exactly what I need.

From the Algorithm::Diff man page:


... you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence S which can be obtained from the first sequence by deleting some items, and from the secend sequence by deleting other items. You also want S to be as long as possible.

This ensures that with original set
1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 10 9 gets the same score as the following set
2 1 3 4 5 6 7 8 9 10 when compared to the original set

'diff' also works reasonably well.. but does not show the difference between
2 3 4 5 6 7 8 9 10 1
and
10 2 3 4 5 6 7 8 9 1

LCS ranks the first better then the second.

> interesting project
hehe .. :) And did someone mention Google?