Forum OpenACS Development: Upgrade and install management

Collapse
Posted by Malte Sussdorff on
I have a couple of issues with the current install behaviour but wanted to collect feedback first before changing something.

a) If I load a db script for an upgrade and it fails, the changes made in the script so far are not rolled back. Shouldn't we not make this either a transaction by default or have the best practice to wrap all upgrade.sql files in a statement?

b) If there is a problem with the upgrade.sql files (see above), why not change our default behavior to do upgrades not in the SQL upgrade but within a special "sql_upgrade" callback, which allows us to source DM changes with db_transaction, therefore not loading a file but executing dml commands from TCL

c) If a callback fails all the other upgrade callbacks are still executed. shouldn't we prevent that? Preferably shouldn't we roll back the whole upgrade (with all callbacks) in an attempt to prevent us leaving the system in a broken state?

d) If it is not possible to undo all, but just the last one, wouldn't it make sense to increase the version number with each single step, effectively changing how we do upgrades? So instead of loading all upgrade.sql files in one go, we take them one step at a time, look for a corresponding callback and then execute them. If there is no error, then increase the version number to the max of both, otherwise abort. This will at least allow us to fix an error upon upgrade and run the upgrade scripts again, especially the callbacks.

e) If the after_upgrade callbacks fail, the package is still enabled. This is okay in that it describes correctly what the callback does and when it is executed. Sadly the granularity fails here, so if you upgrade from XoWIKI 0.48 to 0.60, all of them will be executed after 0.60 has been enabled, which might not be the case as the enabling requires the after_upgrade callbacks to be successful.

f) I am missing an "after DB, before enabling" callback.

I think the majority of my problems come from the fact that upgrades are not treated as a granular thing but we do them in one block. This was not an issue when we only had DB scripts to be concerned about, but with the apm callbacks this can lead to a lot of unexpected behavior. Which is fine if you have a staging server and have the habit of first running the upgrade on staging with the latest production data and then on production, but I'm not sure all users are doing that. Apart from the hassle to reload a backup of your DB 😊.

Collapse
Posted by Dave Bauer on
Shouldn't this be fixed by testing your upgrade scripts?

That is, we shouldn't be checking in and releasing upgrades that don't work.

What does "all of them" refer to in this statement?
e) If the after_upgrade callbacks fail, the package is still enabled. This is okay in that it describes correctly what the callback does and when it is executed. Sadly the granularity fails here, so if you upgrade from XoWIKI 0.48 to 0.60, all of them will be executed after 0.60 has been enabled, which might not be the case as the enabling requires the after_upgrade callbacks to be successful.

I'll come back later and read your suggestions, I think most of them make sense, but I still think if any of the upgrades fail, your system is in an indeterminate state and you need to fix the problem, and go all the way back to your backup and start over. There's no way the system can recover from a partial upgrade.

You are ALWAYS supposed to backup before doing an upgrade. If you don't we can't be responsible. Its clear that especially if you have any custom code, you need to be careful with upgrades since its impossible to predict what will happen if you run an upgrade with a modified system.

So even if you don't have a "staging" server you should be backing up your system, making it unavaialable for login (I usually start it up on a different port) then run the upgrades, and then make the system available again.

Collapse
Posted by Don Baccus on
b) If there is a problem with the upgrade.sql files (see above), why not change our default behavior to do upgrades not in the SQL upgrade but within a special "sql_upgrade" callback, which allows us to source DM changes with db_transaction, therefore not loading a file but executing dml commands from TCL
Why a special callback? I already do my upgrade scripts in Tcl as much as possible, using the existing upgrade callbacks, wrapping the DML in a transaction.

This has the advantage that given today's mature versions of PG and Oracle (i.e. better SQL 92 support in both), odds are much higher than in the past that one can write an upgrade script using DML in Tcl that works fine for both DBs.

While writing SQL upgrade scripts requires you write one for PG and one for Oracle, even if they're the same, if you follow our standard upgrade conventions.

Collapse
Posted by Don Baccus on
d) If it is not possible to undo all, but just the last one, wouldn't it make sense to increase the version number with each single step, effectively changing how we do upgrades?
No, upgrades should be all-or-none.

Remember, your tcl scripts and www pages for version x.x.N+10 are unlikely to work if only x.x.N+1 gets run without error, regardless as to what number(s) you end up putting in the apm_package_versions table.

So I don't see the point behind this.

Collapse
Posted by Malte Sussdorff on
The point behind this would be that I could fix my upgrade script and rerun the upgrade from version x.x.N+1 without having to make sure I do not run scripts which have been successful so far. But probably too much effort, especially as the occasional end user won't be able to fix the upgrade scripts himself.
Collapse
Posted by Don Baccus on
Which is fine if you have a staging server and have the habit of first running the upgrade on staging with the latest production data and then on production, but I'm not sure all users are doing that. Apart from the hassle to reload a backup of your DB
I'll echo dave on this one.

If, for instance, you're upgrading PG without making a dump of the database, I hope I'm not your customer. Why should an upgrade of OpenACS be any different? For any mission-critical software, you should plan for upgrades to fail, and be happy if they don't. That's the nature of the beast.

Collapse
Posted by Malte Sussdorff on
I totally agree with what is proper system setups and testing. All our customers are on a three server setup, usually in three different locations (geographically): Development, Staging and Production.

The main reason for me to be blunt here is that I want to provide a best practice guide on how to do development and use the Package Manager to run upgrades.

Additionally, even in a development server I am usually "ill mooded" if I have to import the database again because an upgrade failed.

Last but not least, keep in mind that upgrades can fail because version 5.2.1 - 5.2.2 makes an assumption that is not true anymore in 5.4.1, but breaks because the TCL procedures used in the 5.2.1-5.2.2 upgrade behave differently now. This happened to me on acs-subsite and a couple of times on XoWIKI. And yes, you should not have a system so out of date, but it still happens (and apparently it happens a lot, from my experience).

This all being said, I think one of the best things we could do is get rid of the /sql/postgresql/upgrade files (for the future) and use the TCL callback to do the upgrade. This would prevent us from having side effects and does allow transactions

Why have one additional callback? Probably no need, especially if we are not having DB scripts to upgrade at the same time, but "before_upgrade" does not sound right and "after upgrade" does not sound right either.

Collapse
Posted by Dave Bauer on
If you want to make sure upgrades work across many versions please test them BEFORE we release. We rely on the community to test things obviously.

ie: "Last but not least, keep in mind that upgrades can fail because version 5.2.1 - 5.2.2 makes an assumption that is not true anymore in 5.4.1, but breaks because the TCL procedures used in the 5.2.1-5.2.2 upgrade behave differently now. This happened to me on acs-subsite and a couple of times on XoWIKI. And yes, you should not have a system so out of date, but it still happens (and apparently it happens a lot, from my experience)."

This can be fixed by updating the upgrade 5.2.1-5.2.2 in 5.4.1 only. That makes the upgrade work fine, but doesn't break upgrades in other releases.

There is no way to test every permutation of versions.

You must backup. That IS the best practice. I still say if an upgrade fails, there is no way to reliably continue upgrades if one script fails. The database is in an unknown state. The only way to proceed is to ttart over, fixing the error. Is the user cannot fix it himself, they report the problem and ask for help, hopefully the upgrade file is fixed in the next release.

Collapse
Posted by Malte Sussdorff on
Dave, to be honest, the easiest thing for me was to remove the offending upgrade from both the callbacks (as is the case with XoWIKI) or the upgrade directory. And whenever I stumble upon such a thing I silently commit to OpenACS on the latest release (as all future releases obviously would have the same problem).

Also note that we run into the same problem whenever we have dependencies on other packages (e.g. a change of PL/SQL function in acs-content-repository which the old upgrade script in acs-subsite does not know about).

Either way, I take from this discussion the following guidelines for the developers:

a) Use TCL callbacks to upgrade whenever possible, do not use the SQL upgrade directories
b) Wrap your TCL callbacks in a db_transaction (and yes I know you are saying this does not help when it fails, but at least I do not have to fix it in the database if all is rolled back).
c) Make sure you require the latest version of a certain package you depend on and check your upgrade scripts at least one major release back if it still works (read: are the procedures still there and is the DM still the same)

For the end users:
d) Make sure you upgrade every major release
e) Make sure to backup before running the upgrade
f) Do not upgrade from the acs-repository (as it fails for too many users it seems).

Collapse
Posted by Malte Sussdorff on
It might (and I say "might" here with major emphasis) be a good idea to require for a release to check at least two major releases back if the upgrade script is still needed. Examples would be

- PL/SQL definitions (if superseeded later)
- View definitions (if superseeded later)
- Creation of later dropped content types
- Creation of anything if it was dependent on a certain version of a package and that has changed in the depending upon package.

Collapse
Posted by Dave Bauer on
Hmm, I don't recall having these type of upgrade issues in general. Now I have found upgrade bugs before, usually I find it in upgrading openacs.org (which is a could of versions behind now anyway...)

In reality it makes sense to try to upgrade from an older version. Of course you really need a populated system to make the test effective. We need people to test this if its important to them.