Forum OpenACS Development: i18n of content repository content

Request notifications

Posted by Dave Bauer on
Now that i18n and l10n of the user interfaces is solved for OpenACS, we need to address i18n of "content".

The content repository currently stores nls_language for cr_revisions, but it isn't really used by any existing applications.

I chatted with Don on IRC yesterday about possible solutions. Setting the language per item is simple, just put the langauge code in cr_revisions.nls_language.

What is more complex is relating items that are the same, except for the language.

We tentatively decided that creating a cr_child_rel between the "original" item and the translated items would be a good soltution. Each translated content would be stored as a cr_item. This way it is straightforward to edit one translation and using cr_child_rel also find all translated versions of an item.

There may be other applications that use the content repository that also need to be addressed. Let us discuss the implications of this design, and how it will work for OpenACS.

Posted by Malte Sussdorff on
Two ideas:
  • If the store the nls_language in the cr-recisions, we use the revisions for multiple languages. To determine which is the live revision, we either create a live_p in cr_revisions or create a cr_item_live_revision_map, which shows the live revisions for each language. Obviously, the live_revision column in cr_items would be made obsolete and we have to make a unique constraint on (nls_language, cr_item_id)
  • If we follow the cr_child_rel approach, the nls_language needs to go out of cr_revisions and into cr_items.
  • For the assessment system we look at various options, have a look at One core element is the overlaying of language as we assume that most content will not be translated. To avoid heavy joining we denormalized the localized tables, meaning all content to be localized would be stored within the table with a default_locale and only if the user does not support the default_locale we'd look up the localization table for additional text.
Posted by Jun Yamog on

Should we take the cr_child_rel approach.  What would happen if we delete the parent item?  Does this mean we will delete the other translation as well?  How about using cr_item_rel instead, so we only loose the relation?

Another issue that we came up with is for none content items / acs object that is related to a i18n content item?

For example we have an article that is translated to several articles.  For each article a none content item/acs object associated with it.  Say author.  What would we do with author?  Do we have a different entry on authors for each translated article, or do we have a single entry for an author and have different i18n entries for the info of the author?

Or do we make an author a content type/ object type?

Posted by Stan Kaufman on
Dave, thanks for beginning this thread. The problem of handling i18n content is a special case of handling complex version content generally. We're grappling with that in our design for the Assessment package.

I've written up the current state of our thinking here:

We want to leverage the CR in all of this, but we're not entirely clear how best to do that. This thread is an important one.

Posted by Jeff Davis on
Another thing to worry about is if you have a translated item you want to track which revision of the document in the original language it's a translation of. Its an issue since you would like to flag for updating any translated doc if the original is updated.

I think its better to have individual cr_items for each translation and relate them since then you can leverage the normal CMS workflow rather than building a second one for handling translation workflow.

Posted by Don Baccus on
I don't like the idea of storing the different translations in one item's set of content revisions for several reasons:

  • Malte's suggestion is a big change to the existing datamodel. Lots of code, including custom user code, would need to change. Lots of upgrade scripts written.

  • People already claim the CR's too complex. This would make it more complex, adding fuel to their argument.

  • I still want to investigate "non-versionable" content-items (item still linked to revision via live_revision and latest_revisions, but sharing the same acs_object row rather than having two objects - the default would be today's versional content item, so no existing client code would be broken). This would give us "lighter weight" content types for applications that don't need versioning, adding weight to the arguments of those of us who lobby for more usage of the CR.

  • We should remember that the vast majority of sites are monolingual, not multilingual. Trade-offs, when necessary, shouldn't make the monolingual case more complex unless it is absolutely impossible to avoid because far more people are affected by this than the multilingual case.

Posted by Malte Sussdorff on
  • I still want to investigate "non-versionable" content-items .

    If you think about non-versionable content-items, why should we still store information about the item in cr_revisions? If we do it for API's sake, then we just make the API smarter....

  • We should remember that the vast majority of sites are monolingual, not multilingual

    This is the primary reason behind my approach of overlayering translation (have a default version that is "overlayered" by a locale specific translation). I agree that your master item approach is best in achieving this.

  • Malte's suggestion is a big change to the existing datamodel

    I agree that idea 1) is a big change, but I don't see how the move of nls_language to cr_items is a big change, especially taking into account that is not used anywhere in the first place (if my knowledge is accurate). Otherwise I don't think the upgrade would be hell.
    My main thinking here is: Only keep information in cr_revisions that is bound to change or has to be displayed to the user.

Posted by Don Baccus on
Jun - the deletion of the original document question's a good one.  The reason Dave and I both thought about using cr_child_rel is because we both like the notion of denoting a "source document" and its translations.  In other words, we don't simply have a set of related documents but rather the original document is an authoritative one.  We felt like a translator would want to know which document is the authoritative source document, and translate from that (assuming they're fluent in the source language, no problem if you're hiring a translation service).  Translating from (say) English to Chinese and then to French seems more dodgy than simply translating from English to French, for instance ... more opportunity for error.

Of course not everyone will care about the "original" vs. "set of translations" relationship but we felt we should support it    natively in whatever datamodel/API Dave develops.

As far as relating other objects ... I see that as outside the scope of the content repository.  If an application wants to relate a set of translators to each translation, it can use acs-rels to do so easily enough or its own private methodology.  The project Dave's working on may require this and if so that code should be available as it is all to be GPL'd, so perhaps that could be turned into a generally useful API for others.

But outside the CR, I would think.  Remember the CR's used for a lot of things other than documents.  Such features could lie in some sort of common CMS API as we get further down the CMS path, perhaps?

Posted by Jun Yamog on

I guess using child relationship should bind it more, that indeed a content was translated from this source content.  So I will go with your point and Dave's point of using child relationships.

I am not sure if i18n of data related to content items if its within CR or not.  But I believe it should be within CMS context.  For example an article will be made of different parts, one part of it is the author (name, title, etc.)  Naturally the author can have more than one article.  So authors will have its own table.  The question is when we pull up a German article, the other data about the author must be in German.  To top it of, a different set of people will be managing the articles and the authors.  Any hints what is the best way to do it will be appreciated.  It maybe possible whatever learned from this project may help OpenACS.

Your suggestion of a non versioned items is helpful.  Although we all know we have been trying to get this, I am not sure if we are prepared to undertake such a big task for this project.

Posted by Malte Sussdorff on
I gave some thought to a TCL API for the CR to do what I need. I'm explicitly missing some functionality the CR gives, which does not mean I don't want them included. I just limit myself to the things needed with regards to I18N.
  • Create new content (cr_item_create): This function will allow us to store new content in the CR, either as a new version, a new translation or a new revision. The API will be smart enough to detect the users wishes based on the information given. Switches:
    • -content_type: Content type of the item we want to create. This also defines what additional switches need to be given. Optional if -item_id is given.
    • -item_id: If supplied the system assumes we want to store a new revision of the cr_item, otherwise it will create a new item.
    • -language: Defines the language of the item, defaults to the system language. If item_id is given as well, check if the language stored with the item_id matches. If yes, assume it is a new revision, if not, create a new item with the parent_id "item_id" (if done via rels or any other method should not concern the API).
    • -xxx: All the switches necessary e.g. for the category system.
    • -yyyy: All the switches necessary to fill the attributes of the content_type. If a not optional value is missing, throw an error.
    • Change live version: This function will define the live version. It only has to be called if the live version differentiates from the latest version (and we have versioned items in the first place).
      • -item_id: Item we want to change the live version for.
      • -revision_id: Revision of the "to be live" version.
      • -language: If given, this changes the "master" version. Usefull if you realize that the dutch version changes more often than the "master" french version and you want to display the dutch version on your page by default instead.
      • Retrieve content: This function will retrieve the content of an item and put it automatically into variables in the callers context. As an additional benefit the variable cr_attributes in the callers context will contain a list of the variables set.
        • -item_id
        • -language: Language version of the item_id. If the item_id is not the master item_id for translations and the language does not match the item_id, try to find the corresponding item_id, by going up one level to the master_item and look from there for the language. (optional)
        • -revision_id: Always return the live_revision unless the revision_id is given. In this case just return this revision. (optional)
        • -attributes: List of attributes that should be set as variables in the callers context. If {all} is given, return all.
        • Okay, this is a very rough sketch, maybe it helps, maybe it doesn't and hopefully I just missed something and it already exists :-).
Posted by Dave Bauer on
There is a small problem with this:

-item_id: If supplied the system assumes we want to store a new revision of the cr_item, otherwise it will create a new item.

This doesn't work if you predefine the item_id, such as pre-generated keys for double-click protection. Also the pl/sql api supports supplying an item_id without creating a new revision. The pl/sql api only creates a new revision if title or content is specified, and I think as much as possible the tcl api should function similarly to the pl/sql api. Its not absolutely necessary, but it is the design with the least surprise.

Right now the BCMS tcl api has a tcl procedure to create a new item, and another to create a new revision. Once you have the item created, there is no need to call the procedure to create a new item.

I don't think the tcl API really needs to guess for you if your content is a translation. Clearly there would need to be a user interface to specify a translation, and the form processing should be able to handle that case with the information provided by the user. For example in the case of using a cr_chil_rel to relate translations:

content_item::new <params>
content_revision::new <params>

to create the inition item,

content_item::new <params>
content_revision::new <params>
content_item::child_rel::new item_id_one item_id_two "translation"

Or something along those lines would make it pretty clear what is happening.

There is a procedure to change the live version, and more than one procedure to get content already :) We do want to make the procedures more consistent, right now they are not.

Posted by Jun Yamog on
I agree with Dave, smaller api is a bit better.  I also think there is no problem with Malte's concern.

There where times that I created a proc that called several bcms proc to make things easier.  For example... jun::create_page.  That makes a new item, a new version and a predefined content if no content is passed, it also sets the first revision to live.

Posted by Don Baccus on
-item_id: If supplied the system assumes we want to store a new revision of the cr_item, otherwise it will create a new item. This doesn't work if you predefine the item_id, such as pre-generated keys for double-click protection. Also the pl/sql api supports supplying an item_id without creating a new revision. The pl/sql api only creates a new revision if title or content is specified, and I think as much as possible the tcl api should function similarly to the pl/sql api. Its not absolutely necessary, but it is the design with the least surprise. This is the approach I took with GP and it worked very well. Frankly I think the approach taken by the PL/SQL code to be extremely arcane in this case.
Posted by Don Baccus on
Actually I take that back - the routines I wrote for GP take an explicit -new flag (pass in ad_form_new_p) and allows for passing in a pre-allocated key...
Posted by Dave Bauer on
See also this discussion on i18n of content:
Posted by Matthias Melcher on
From a non-techie point of view, I think it is more
important to relate a translation to the content item
than to its versions, since translators will probably
not want to translate each version again but rather
adapt the changes incrementally (unless they do it
automatically, which guarantees poor quality, or have
it done by companies paid for line counts, which
guarantees high cost).

The distinction made above between different translation
paths ("localizaton" vs. "translation") will, IMO, be more
important to the administrators of new versions and
translations than to the reader, and the terminology
is confusing since incorrect: Localizaton is not the
end product of a translation process but may be more,
including, for instance, replacing a red cross by a red
(or green?) half-moon.

I am surprised how much effort is invested in determining
automated ways to have dotLRN present a localized page
rather than (a) let the user's browser settings negotiate
the language (does AOLserver support Apache's multiview
method at all?), and (b) let the user decide explicitely.
Especially when current quality translations are not
available, the locale setting selected for the UI need not
at all be the optimal choice for reading complicated texts.

Posted by Dave Bauer on

Good points. The knowledge of which language a tranlation was translated from is not interesting except possibly to an administrator.

To a user they should be presented with content in their choosen locale, and optioanlly a list of links to the content in other languages.

Posted by Malte Sussdorff on
The knowledge of which language a tranlation was translated from is not interesting except possibly to an administrator.

I disagree. It is important to the user, especially if he does not like the translation, to see what languages the content derived from. This way you can always have a link stating "view the original in language xyz". Furthermore, if the translation is plainly wrong (e.g. documentation), it might help the user to know that this documentation is a translation and *not* the original, therefore having the option to fall back on the original.

It might be my personal preference, but this is why I tend to read documents in the original language as long as I can understand it fair enough. Having read too many german translations and watched too many dubbed movies I think translations are a primary source for misunderstandings and therefore should be clearly marked as such.

Posted by Matthias Melcher on
if we are talking about translations AS content rather than
translation OF content, you are certainly right.

In a class dealing with the critical edition of some French
thinker, for instance, the user should not be automatically
directed to, say, the English translation because his UI
locale is German and, because of the missing German
translation, the default of English becomes active.
Instead, they should be offered all translation items with
all available genesis information.

In contrast, the translations of educational texts of some
arbitrary knowledge domain, should not be allowed to be
so bad that the source needs to be consulted, and some
content management approval procedure would probably
guarantee this quality. Therefore it makes sense to
automatically suggest a translation without bothering the
user with too much choice.

Posted by Jesse Wendel on
I strongly disagree.

It is important in looking at any content of significance, to know if it is original source material, or a translation.

If it is a translation, the reader should:

a) know what language it is translated from (which may allow one to make sense of gramatical and cultural mistakes in the translation) and,

b) have access to the original source material if at all possible, so one may check the translation for oneself.

Obviously, with OpenACS, both of these should be possible.


Posted by Joel Aufrecht on
So if the original of an item is in German, and somebody translates it to French, and then somebody goes French-English, should the English page have a link "translated from French(link); original document is German(link)"?
Posted by Malte Sussdorff on
I do think so. Though this might result in a pretty long translation trail...
Posted by Guan Yang on
Joel: Why don't we just say that this is application specific?
Posted by Dave Bauer on
Ok, so it appears that exposing the tree of translations to the user is application specific. So it looks like it is generally useful to store this information.

Any bright ideas on what the relationships should be called?