Forum Moderators: coopster

Message Too Old, No Replies

Archiving versions of database entries and diff

thoughts and tricks?

         

ergophobe

5:30 pm on Jul 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm looking to change a database application so that when users make changes, a certain number of versions get archived so that users could potentially see previous versions and, preferably, get some diff output.

The goal is really to version one large text field, though I would probably archive the entire record from that table.

I'm thinking the simplest way to do is to
- create a new table for versions with all the same fields as the current table, with another field for a version number and a timestamp.
- archive the entire record before updating it with the new data.
- run diff as needed when users want to see older versions.
- run a chron job to check how many versions there are and what dates they have, and delete any versions above #? and older than X months (days, whatever).

Since the data in the text field is responsible for about 90% of the 20MB in the DB, I'm a little afraid that if I end with ten versions of each record, I might end up with a huge database. That's not necessarily a problem as this is hosted on a 250+ Linux super-computer cluster, so the server has the juice to handle anything I want, but I think it may just become cumbersome.

It also occurred to me that I could reverse the process, and just version the diff output like CVS does, and then, if someone needs to, I could reconstruct versions based on the diff output.

Advantage: much less space and it would be easy to simply not archive versions where only whitespace is changed.

Disadvantage: fair bit more programming and much harder to reconstruct something from multiple versions ago.

Anyway, I wonder whether any of you have done any similar sort of versioning.

physics

11:43 pm on Jul 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One option would be to only store the current version in the database.
Then set up a CVS repository and just use CVS to do the version tracking. So every time someone makes a change you just save the new version as a file and cvs commit. Then when they want to see different versions, etc, just pull them using CVS.

ergophobe

2:20 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmmm... the thought briefly crossed my mine. It has some notable advantages (it's already built specifically for versioning). I'd have to learn a bit more about the internals of CVS and about accessing CVS via PHP. I suppose if I were going to that trouble, I might as well go the Subversion route.

I'm thinking I might be too lazy for this approach... I need to think about that.

coopster

10:39 pm on Jul 2, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Excellent question, quite the dilemma. I wonder how manual writer's maintain version differences? Do they keep whole copy? Or somehow track differences as you stated? I'm guessing whole copy.

Is there any reason you wouldn't just add the version number to the PRIMARY KEY in the table rather than create multiple version tables?

ergophobe

5:14 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Is there any reason you wouldn't just add the version number to the PRIMARY KEY in the table rather than create multiple version tables?

Good question. I suppose I could. I guess the reason I don't is that separating it out
- simplifies the query-building code
- avoids having to run through 10x as much data (assuming saving last ten versions) when that extra 90% is only backup data.
- simplifies the backup process (which consists of a cron job that runs daily and basically dumps the table in question, zips the dump and emails it to an offsite server).

Users are commonly running searches that are either partially or totally unable to take advantage of indexing (i.e. multi-field combos and I can't index every conceivable combo) and almost always with wildcards (at the very least LIKE '%string%', but sometimes more complicated regex). The searches can return hundreds of results, all of which have text fields (so no known offsets in the table file). That's already a fairly inefficient search and it can already take some time to do these searches on the 20MB that I have, especially if the very large text field is included in the search. If I keep lots of versions, that seems like it would slow things down.

It also means that for query-building and backups, I would then have to filter to return only the most recent version of each record. If they're all in the same table, that's extra programming logic. Thankfully the queries are all built in one class member function for each type of query, but still that means that every matching record has to be checked to find out which one is the most recent...

Seems like more trouble than benefit.


Manual writers

I think most of the time they save complete versions. Since they are likely working in applications with proprietary file formats (PageMaker, InDesign, possibly Word), normal versioning software (CVS, Subversion), don't really work. That's a problem we face and have never really been able to solve with the published component of the project (which is the main part of the project, and the DB is to support that).

Some software has versioning built in or as part of the general suite (Adobe VersionCue for example), but we have not been able to find anything that would do for InDesign or Word documents what CVS does for ASCII text, namely merge concurrent changes.

VersionCue does run as a server and Microsoft has something called MeetingSpace or some such thing, but all you can do is check versions in and out, without any concurrent versioning.

Without concurrent versioning that allows changes to be merged into the repository copy, you pretty much have to just save complete versions. Plus of course, you have the problem that if I change a style definition in a version, that changes the look of the whole document, which in this context actually matters.

physics

7:58 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




I'm thinking I might be too lazy for this approach... I need to think about that.

On the contrary I think it's the lazy way to go ;) The subversion / cvs groups have already done the hard work for you. And as for accessing svn via PHP it shouldn't be hard, just use the system() command and run the svn commands. All of the output is plain text so you can parse it as you like with regexes too.
You also might find the web-based cvs/svn browsers useful.

Yet another option would be to look in the code of and/or borrow from programs such as MediaWiki that do version control.

ergophobe

8:37 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




borrow from programs such as MediaWiki that do version control

I looked through Drupal. It just makes a full copy. I'll have to check out MediaWiki and see how it does it. Drupal is pretty rudimentary.

As for hooking into Subversion being easier, the thing is that every time a record in this table is updated, it uses the same class function, which means that by adding three lines, my rudimentary, drupal-style archiving is done... kind of limited though.

ergophobe

9:05 pm on Jul 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Keep a topic going long enough and something has to jump out.

The idea of looking at MediaWiki is exactly the sort of thing I was after. That pretty much provides me with everything I need. Basically, Mediawiki approaches it sort of like my approach with a new table and diff after the fact, but with some differences. Note: though they use a subversion browser for the documentation, they do not track versions that way, perhaps for some of the reasons I mentioned.

Basically they do it like this with four tables

- page: this is the core table that holds all information about a page *except* the main, very large content field. It has a field "page_latest" that tracks the id in the text table

- text: nothing but the PK, a text field, and a "flags" field for comma-separated flags like "gzip" or "utf8".

- revision: basically info about the deletion date and such, but like the page table, the main content is in the text table.

- archive: this is for deleted pages such that there is still a record of the page, but it is no longer part of the wiki unless undeleted.

Finally, DifferenceEngine.php lets you compare revisions.

I think that's more complicated than I need, though breaking out the "text" does mean that "page" table could use all fixed-length records (MediaWiki does not though) for faster retrieval.

Overall, I think that looks like a pretty good system and fairly easy to implement. If I can get the diff engine to ignore changes in whitespace, that will pretty much be the Holy Grail as far as this project is concerned.

Thanks.

ergophobe

5:10 pm on Jul 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Last update.

It is really easy to take the MediaWiki DifferenceEngine class library and plug it into an existing app. Voila! diff output, formatted more or less how you want.