Forum OpenACS Development: message catalog file format, syncing of message catalog files with database

I would like to collect some ideas on the format of message catalog files and how they should be loaded and synced with the database. Currently, the format of the catalog files is:

_mr en_US dotlrn.my_portal_pretty_name My dotLRN Portal

Where _mr is the message register TCL proc that registers inserts the message into the database, or updates (overwrites) a message in the database if it is already there. The whole catalog files are evaluated as TCL code.

My idea was to move to an XML format along the lines of:

<message_catalog>
   <msg key="dotlrn.my_portal_pretty_name">My dotLRN
Portal</msg>
   ...
</message_catalog>
</pre>

Notice that I am not providing the locale or charset as we have adopted the convention of having one catalog file per locale. The locale and charset information is in the filename.

The XML format makes it easier to parse the contents of a catalog file and loop over the messages without necessarily registering them.

The maybe trickiest issue that we are dealing with is syncing those files with the database. Suppose you have translators using the web UI to update the message catalog in the database. You might also however want to import catalog files with translations doen by other parties. What should we do when we import a message from a file that differs from what is in the database? My idea was to backup the old value in the database in some place in the file system and overwrite it with what's in the message file (the message file takes precedence). We would then also inform the admin about this (produce a message import report). There are many other alternatives here though, thoughts?

Another issue is that when we are extracting message keys and messages from adp templates with our script, I want to know if the message key already exists with a different value, in which case I make the key unique before insertion. I currently make this check with the message files only (ignoring what's in the database). Unfortunately I do this with a fragile regexp on the TCL code in the catalog files. With an XML format this would be easier. On the other hand, one could argue that we should be checking against and updating the database here (ignoring what's in the file system). Then at a later stage the contents of the database can be dumped to the catalog files. Is this a better way to go?

Thanks!

/Peter
Two thoughts.

- You've got two storage locations for your data, the database and the file system of XML files (great idea on the XML, btw). I suggest picking *one* of them as the master data source and the other as the slave, and always making sure that your use cases follow this principle. Otherwise, you're in for some serious conflict resolution all over the place. For example, you could pick the DB as your master, and the filesystem of XML as just a slave of the DB that gets updated regularly by just dumping out the DB contents. Then, if you wanted to import data into the DB, you would do that *explicitly* through an "import" UI, just so that your users know that the master copy is in the DB. That's my recommendation: pick a master and stick with it.

- quick note on the XML format: I would recommend following the model of the XQL files, where both the filename and the XML indicate metadata. In the case of XQL, the filename actually matters little (except for tagging in the APM), since the Query Dispatcher is looking at the XML to decide which database those queries support. What you get here is a less error prone system: you wouldn't want to "forget" which locale a message catalog is in just because you moved the file incorrectly to an incorrect filename.

I agree with Ben.

Just a quick note: Lars has been importing several emacs backup files into CVS lately. I've e-mailed him yesterday about it but he seems to not have gotten my mail since more backup files were added today.

Thanks Ben and Roberto! I have implemented the xml based catalog file format now, and it even has its own acs-automated-testing tests in acs-lang/tcl/test. What you are saying Ben makes a lot of sense and that's basically the approach that we are taking. There is currently no conflict resolution when importing XML file, the file simply takes precedence. I have moved to the more appropriate terminology import/export of catalog files. When exporting, everything that is in the database is dumped, any existing catalog file is backed up to catalog-filename.xml.org.

The catalog xml format is:

<message_catalog package_key="dotlrn" locale="en_US" charset="ISO-8859-1">
  <msg key="pretty_name"> dotLRN</msg>
  ...
</message_catalog>

About the tilde files I will take the full blame for that and it's been corrected. Before I got commit rights I was committing with Lars's account 😊

Are there any considerations concerning the fact that most messages of different locales that are in the same language will be equal, e.g. those in en_US and en_GB?

Would it make sense to consider this either in the datamodel or in the UI at least? I think it is mostly about effort needed for translation, that could be saved if locales of the same language would by default share the same messages and only exceptions from the common message would have to be entered for each locale separetely.

This would get more important if there were languages with many different dialects (are there any?). How does other i18n software deal with this?

Tilmann, thanks for catching this language issue as I think we migth have missed it otherwise.

How about this approach: Typically a package is developed in en_US locale. Then, when Simon sets his preference to en_GB the message lookup will attempt that locale but fall back on en_US in most cases where they are the same. This means the en_GB message catalog file only contains entries for the difference to the en_US locale. Does this make sense? If so the next issue is how to best implement it...

Peter, when we were looking at it the issue we had with the sort
of fallback you are talking about was that unless you were
sort of careful about it, it could be a real performance issue
for any site not a "primary" (fr_CA, es_US etc) where almost
all catalog lookups end up having to go through an extra step or
two to find their message.  It's probably not a big deal if it does not mean going to the DB though.

The only other issue is in terms of maintainability, the en_GB entry
where needed is easily overlooked when changing the en_US message.

Maybe it would make sense to only enter the language as message key instead of the full locale in the database if a message should be valid for all locales of that language, and enter the full locale if a message is only valid for one locale. In messages for which both a language-only and a full-locale entry are availabe, the full-locale entry would then override the language-only one (if it matches the specific locale that you are querying for).

I am sure that if it is stored this way it would be possible to write a smart query that does: "select all message keys with language en but prefer en_US if available" for en_US in this example. You could even add a "and if a message is only available in en_GB then I want it too in my resultset".

This way the initial translation could be generic for one language, and as soon as someone feels the need to localize it that can be done by adding the dialect variants.

The insight of someone who has experience with the way this is handled e.g. in GNU gettext would be a valuable input here. Or you want to look through this http://i18n.kde.org/, looks promising.

I think it is important that we avoid complexity where possible. I would like to second Tilmann suggestion: to look for input from others who might have implemented this in another context.