Forum Moderators: coopster
The goal is really to version one large text field, though I would probably archive the entire record from that table.
I'm thinking the simplest way to do is to
- create a new table for versions with all the same fields as the current table, with another field for a version number and a timestamp.
- archive the entire record before updating it with the new data.
- run diff as needed when users want to see older versions.
- run a chron job to check how many versions there are and what dates they have, and delete any versions above #? and older than X months (days, whatever).
Since the data in the text field is responsible for about 90% of the 20MB in the DB, I'm a little afraid that if I end with ten versions of each record, I might end up with a huge database. That's not necessarily a problem as this is hosted on a 250+ Linux super-computer cluster, so the server has the juice to handle anything I want, but I think it may just become cumbersome.
It also occurred to me that I could reverse the process, and just version the diff output like CVS does, and then, if someone needs to, I could reconstruct versions based on the diff output.
Advantage: much less space and it would be easy to simply not archive versions where only whitespace is changed.
Disadvantage: fair bit more programming and much harder to reconstruct something from multiple versions ago.
Anyway, I wonder whether any of you have done any similar sort of versioning.
I'm thinking I might be too lazy for this approach... I need to think about that.
Is there any reason you wouldn't just add the version number to the PRIMARY KEY in the table rather than create multiple version tables?
Is there any reason you wouldn't just add the version number to the PRIMARY KEY in the table rather than create multiple version tables?
Good question. I suppose I could. I guess the reason I don't is that separating it out
- simplifies the query-building code
- avoids having to run through 10x as much data (assuming saving last ten versions) when that extra 90% is only backup data.
- simplifies the backup process (which consists of a cron job that runs daily and basically dumps the table in question, zips the dump and emails it to an offsite server).
Users are commonly running searches that are either partially or totally unable to take advantage of indexing (i.e. multi-field combos and I can't index every conceivable combo) and almost always with wildcards (at the very least LIKE '%string%', but sometimes more complicated regex). The searches can return hundreds of results, all of which have text fields (so no known offsets in the table file). That's already a fairly inefficient search and it can already take some time to do these searches on the 20MB that I have, especially if the very large text field is included in the search. If I keep lots of versions, that seems like it would slow things down.
It also means that for query-building and backups, I would then have to filter to return only the most recent version of each record. If they're all in the same table, that's extra programming logic. Thankfully the queries are all built in one class member function for each type of query, but still that means that every matching record has to be checked to find out which one is the most recent...
Seems like more trouble than benefit.
Manual writers
I think most of the time they save complete versions. Since they are likely working in applications with proprietary file formats (PageMaker, InDesign, possibly Word), normal versioning software (CVS, Subversion), don't really work. That's a problem we face and have never really been able to solve with the published component of the project (which is the main part of the project, and the DB is to support that).
Some software has versioning built in or as part of the general suite (Adobe VersionCue for example), but we have not been able to find anything that would do for InDesign or Word documents what CVS does for ASCII text, namely merge concurrent changes.
VersionCue does run as a server and Microsoft has something called MeetingSpace or some such thing, but all you can do is check versions in and out, without any concurrent versioning.
Without concurrent versioning that allows changes to be merged into the repository copy, you pretty much have to just save complete versions. Plus of course, you have the problem that if I change a style definition in a version, that changes the look of the whole document, which in this context actually matters.
I'm thinking I might be too lazy for this approach... I need to think about that.
Yet another option would be to look in the code of and/or borrow from programs such as MediaWiki that do version control.
borrow from programs such as MediaWiki that do version control
I looked through Drupal. It just makes a full copy. I'll have to check out MediaWiki and see how it does it. Drupal is pretty rudimentary.
As for hooking into Subversion being easier, the thing is that every time a record in this table is updated, it uses the same class function, which means that by adding three lines, my rudimentary, drupal-style archiving is done... kind of limited though.
The idea of looking at MediaWiki is exactly the sort of thing I was after. That pretty much provides me with everything I need. Basically, Mediawiki approaches it sort of like my approach with a new table and diff after the fact, but with some differences. Note: though they use a subversion browser for the documentation, they do not track versions that way, perhaps for some of the reasons I mentioned.
Basically they do it like this with four tables
- page: this is the core table that holds all information about a page *except* the main, very large content field. It has a field "page_latest" that tracks the id in the text table
- text: nothing but the PK, a text field, and a "flags" field for comma-separated flags like "gzip" or "utf8".
- revision: basically info about the deletion date and such, but like the page table, the main content is in the text table.
- archive: this is for deleted pages such that there is still a record of the page, but it is no longer part of the wiki unless undeleted.
Finally, DifferenceEngine.php lets you compare revisions.
I think that's more complicated than I need, though breaking out the "text" does mean that "page" table could use all fixed-length records (MediaWiki does not though) for faster retrieval.
Overall, I think that looks like a pretty good system and fairly easy to implement. If I can get the diff engine to ignore changes in whitespace, that will pretty much be the Holy Grail as far as this project is concerned.
Thanks.