Forum Moderators: coopster

Message Too Old, No Replies

detecting webpage changes / differences

trying to find existing php script

         

amznVibe

12:50 am on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was trying to write my own code to display the differences between two webpages but I am having great difficulty. I also cannot seem to find such an animal on the web, it's incredibly hard to search for without concise terminology.

Anyone have any ideas or seen such a thing?

ergophobe

1:37 am on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been thinking about the same thing (comparing two text files).

Have you looked at anything like, say, the source code for CVS (that's open source isn't it?) to see what algorithms they use?

henry0

12:19 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try UE good compare files system

dcrombie

4:52 pm on Jan 31, 2004 (gmt 0)



Can't you just use the UNIX "diff" command? I would grab the files using "curl" or "wget", then call diff and work out which options give the best results.

amznVibe

5:17 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well let's go back to the little details, like it's webpages that need to be compared, so all html and script would need to be stripped before comparison. Then it's being done from PHP so while shell access is possible, it might be overkill.

I've done a little more research and PHP has array compares that might work. I am going to have to tinker a bit to find a better answer. Just very surprised something doesn't leap out at me when I try to Google for previous work.

jatar_k

8:18 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



you could do as dcrombie said and use one of the PHP Program Execution Functions [ca.php.net] to fire diff

ergophobe

8:42 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know whether I should confuse this thread with my concerns which are different from amznVibe's, but...

I'm comparing long blocks of text from a database - basically allowing users to see where the version they have created differs from an existing version in the database.


Can't you just use the UNIX "diff" command?

A good start, but with some problems

- "diff" does a line-by-line comparison, which is better than nothing, but I want to analyze large chunks of text where perhaps one word in every paragraph has changed and the paragraphs are 200 words long. With "diff", it then flags the entire text as "diff"erent, which is not exactly optimal.

- I haven't figured out how to process the output reliably since the text will have numbers and greater than and less than signs, so it makes it hard to use regex and parse the output (for example, I can't use "<.*" to find deletions) and then reuse it to, for example, create search strings and then be able to add a <span class="highlight">search string</span>).


PHP has array compares that might work

I don't really see how. Consider the following:

$str1 = "The cat ate the dog";
$str2 = "The dog ate the cat";

$arr1 = explode(" ", $str1);
$arr2 = explode(" ", $str2);

$diff = array_diff($arr1, $arr2);

What's in $diff? Answer: nothing. There are no differences from the point of view of PHP array comparison. Of course, for the dog and the cat it makes a big difference.

You have to work your way through the strings.

Assume version 1 and versions 2a and 2b

1. The dog ate the cat
2a. The lion ate the cat
2b. The wolf ate the cat

- you iterate through and get to the second word, you have a difference and you stop.
- you take 2a as your model and you search 2b for "lion" - not found
- take 2b as your model and search 2a for "wolf" - not found
- take 2a as your model and search for "ate" - match.
- continue testing - is it a real match? Yes.
- Done the difference: "lion" and "wolf" don't match.

That's pretty complicated already and we've only scratched about the simplest case there is.

See Imre Simon, Sequence Comparison: Some Theory and Some Practice (1988) [citeseer.nj.nec.com]

There's a wikki page on Diff Algorithm [c2.com]

If you want, you can look at the source code for GNU diff [cvs.sourceforge.net]

Tom

ergophobe

9:55 pm on Jan 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Poking around, I found a PERL diff module that does pretty much what I'm looking for. There's also an HTMLDiff module that is pretty close to what AmznVibe was looking for.

The documentation on Algorithm::Diff [search.cpan.org] includes all the source code [search.cpan.org]. Now if only I knew Perl... time to learn.

Tom

amznVibe

4:25 am on Feb 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I knew if I asked here I would get some helpful results :) I don't know Perl fluently but enough to hack at it and learn. Thanks for the starting point!

ergophobe

4:37 pm on Feb 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a low priority for me, but I played around with the perl module a bit yesterday. It works great, but unfortunately it outputs in a stream and does not return values, which would be a lot better.

Currently returns output like:

This
word
is
<span class="B">already</span>
<span class="A">no</span>
<span class="A">longer</span>
in
the
string.

I think I'll need to send the output to a file (unless Perl has a method for capturing standard output) and parsing it.

It might take a while though - yesterday I still didn't know what the assignment operator was in Perl, so I'm starting from -273 Celsius ;-)

Tom