Forum Moderators: coopster
I've done a little more research and PHP has array compares that might work. I am going to have to tinker a bit to find a better answer. Just very surprised something doesn't leap out at me when I try to Google for previous work.
I'm comparing long blocks of text from a database - basically allowing users to see where the version they have created differs from an existing version in the database.
Can't you just use the UNIX "diff" command?
A good start, but with some problems
- "diff" does a line-by-line comparison, which is better than nothing, but I want to analyze large chunks of text where perhaps one word in every paragraph has changed and the paragraphs are 200 words long. With "diff", it then flags the entire text as "diff"erent, which is not exactly optimal.
- I haven't figured out how to process the output reliably since the text will have numbers and greater than and less than signs, so it makes it hard to use regex and parse the output (for example, I can't use "<.*" to find deletions) and then reuse it to, for example, create search strings and then be able to add a <span class="highlight">search string</span>).
PHP has array compares that might work
I don't really see how. Consider the following:
$str1 = "The cat ate the dog";
$str2 = "The dog ate the cat";
$arr1 = explode(" ", $str1);
$arr2 = explode(" ", $str2);
$diff = array_diff($arr1, $arr2);
What's in $diff? Answer: nothing. There are no differences from the point of view of PHP array comparison. Of course, for the dog and the cat it makes a big difference.
You have to work your way through the strings.
Assume version 1 and versions 2a and 2b
1. The dog ate the cat
2a. The lion ate the cat
2b. The wolf ate the cat
- you iterate through and get to the second word, you have a difference and you stop.
- you take 2a as your model and you search 2b for "lion" - not found
- take 2b as your model and search 2a for "wolf" - not found
- take 2a as your model and search for "ate" - match.
- continue testing - is it a real match? Yes.
- Done the difference: "lion" and "wolf" don't match.
That's pretty complicated already and we've only scratched about the simplest case there is.
See Imre Simon, Sequence Comparison: Some Theory and Some Practice (1988) [citeseer.nj.nec.com]
There's a wikki page on Diff Algorithm [c2.com]
If you want, you can look at the source code for GNU diff [cvs.sourceforge.net]
Tom
The documentation on Algorithm::Diff [search.cpan.org] includes all the source code [search.cpan.org]. Now if only I knew Perl... time to learn.
Tom
Currently returns output like:
This
word
is
<span class="B">already</span>
<span class="A">no</span>
<span class="A">longer</span>
in
the
string.
I think I'll need to send the output to a file (unless Perl has a method for capturing standard output) and parsing it.
It might take a while though - yesterday I still didn't know what the assignment operator was in Perl, so I'm starting from -273 Celsius ;-)
Tom