Forum OpenACS Q&A: RFC: How to internationalize content

The current i18n system works well for user interface elements, but is not suited for creation and editing of large amounts of internationalized content. This is an architectural proposal for how to store and use internationalized content in OpenACS 5 or later.
  1. All internationalizable content is stored in the content repository
  2. Localized versions of content are as cr_items, separate from the original cr_item
  3. cr_items which are localized versions of other cr_items have a cr_child_rel relationship to the original item, with the relationship type "localized".
  4. There can be only one localized child per item per locale.
  5. If the original item is not in a published state, none of its localized versions are published.
  6. If a localized child is not in published state, the original item is not available for that locale.
  7. Localized cr_items have their own internal versions just like all cr_items. To have two different versions of an original item that have independently maintained and published localizations, you must use two different items.
  8. There is a mechanism to determine if the original item has been changed since the most recent publication of a localized version. This can be used to notify authors or to automatically de-publish obsoleted translations.
  9. Whenever a function would normally return the title or content of a cr_item, the internationalized form of such a function should instead return, in descending order of preference:
    1. The title and/or content from the live version of the "localized" cr_child of the original item in the locale specified to the function, if available.
    2. The title/content from the live version of the "localized" cr_child of the original item in the default locale for the language of the specified locale, if available.
    3. The title/content live version of the original cr_item.
    4. Nothing

Additional issues: "jcdldn: For the multiple items using cr_child_rels you would still need to modify the publish stuff to accomodate publishing things per language variant (so for example if you had an article with a photo and caption that needed a translated caption, so 3 child items, the article which needs localization, the photo which might or might not, and the caption which would need it) and only advertise a language variant as available if everything that needed to be translated was present." "jcdldn: And of course for dialect variants and some less formal publishing circumstances you might want to relax that and publish if the article was translated but the caption was not for example."

Collapse
Posted by Dave Bauer on
Joel,

I agree with Jeff about the rules on publish status. I think overall that might need to be addressed. Right now most applications (all?) do not use publish status. So I think we should discuss how it should work, and at what level the rules should be applied.

We should think about how the tcl API for creating and publishing items and translations of those items will work as part of the content repository Tcl API.

Collapse
Posted by Dave Bauer on
Here is a related thread where similar ideas were discussed: https://openacs.org/forums/message-view?message_id=169751
Collapse
Posted by Dave Bauer on
One thing we want to think about is that locale is in cr_revisions, and it seems that we do not want locale to change per revision. So, locale should be made an attribute of cr_item.
Collapse
Posted by Jun Yamog on
I agree with all of the items, actually we will be doing something similar if not exact.  I think the only difference that I can see is that we use a different relation label, we use "translated" and not "localized".  Maybe can just use one and basically develop cooperatively.
Collapse
Posted by Joel Aufrecht on
Yes, we should definitely do this only once.  Which is why I'd like to TIP a generally acceptable set of rules, so that the implementors have some discretion but still produce something we all want to use.  What all should go into the TIP?

1) The stuff in my first post
2) DaveB's suggestion to move locale from cr_revision to cr_item - (what consequences?  Do we simply break any code that might be using it?)
3) Change the whole CR to use the same locales table as acs-lang and the rest of the system.  (Currently, CR uses a different table, with a different key size.)

I'd like to leave out API and have this just be a data model and recommended behavior TIP.  I think the dual-locale thing is simply broken and should be merged.  2 and 3 can both be handled with upgrade scripts, but both do change the cr in a way that will break anything relying on it directly.  Can we do anything about that?

Collapse
Posted by Dave Bauer on
Joel,

I am not aware of any existing code that uses the cr_revisions.locale information.

One possibility to support custom applications we are not aware of would be to add cr_items.locale and wait one release cycle to remove cr_revisions.locale.

Jun,

I think the "localized" tag makes more sense, and it won't affect our plans.

One use case that is not addressed is determining which localization of an item was used to translate from. I think this is applicaiton specific and can be accomodated by recording this additional relationship seperately. I think it is necessary to tie all localized items to one "master" content_item. This makes it efficient to show a list of all localized versions of an item. An example would be if an item was origianlly written in english, then translated to french, and subsequently was translated from french to german.

Collapse
Posted by Guan Yang on
Here's a suggestion from the meaning of 'translated' and 'localized':

Item A: Original English item
Item B: French version translated from item A
Item C: German version translated from item B

Both items B and C have parent_id pointing to item A with the tag 'localized'.

Additionally, C has a rel with item B with the tag 'translated'.

This means that 'translated' means 'based on this item', while 'localized' means 'is this this item in language X'.

Collapse
Posted by Dave Bauer on
Guan,

Thank you, that is exactly what I meant. You said it much more clearly than I did.

Collapse
Posted by Joel Aufrecht on
So let's put in both "localized" and "translated".  If a given item has neither a localized nor a translated tag, it is not derived from any other content.  If it has only localized, it is both localized and translated from that item.  If item X has translated of Y and localized of Z, we could enforce the rule that Y must be in a translation chain leading to Z, or we would could just not bother.  Should it be allowable to have a "translated" but not a "localized"?
Collapse
Posted by Jun Yamog on
Hi,

In our case the translated will be implemented by pointing the parent id.

Item C.parent_id -> Item B
Item B.parent_id -> Item A
Item A.parent_id -> usual folder id in CR.

It used to be design using relations like as stated above.  I think I change it using parent_id since it has built in tree functions.  What do you guys think?  I guess the draw back is that parent_id does not really say translated.  Although in our case we use it as basically cloned from or copy from.  The use cr_child_rel to denote that its a translation.

Collapse
Posted by Joel Aufrecht on
Maybe we should do both.  I definitely think we should do the child_rel stuff in some form.  I hope that you put it into your current implementation, because if you don't retain that information in your data model, it will be much harder to upgrade to a standard that has explicit translate/localize settings.
But if we have the child rels but don't use parent id, then any query that gets back a bunch of cr_items will have to filter out based on child_rel.  So this argues, if I understand the cr model correctly, for using your parent_id approach for every content item that is a translation or localization of another item.
Collapse
Posted by Joel Aufrecht on
There actually already is a cr_items.locale.  The only problem is that it's a Foreign Key to cr_locales.locale, which is varchar(4), whereas ad_locale.locale is varchar(30) and in practice contains only five-character strings.  The upshot is that the upgrade should be really easy.  We just have to change the fk to use ad_locale, and figure out how to handle any data currently using cr_locales.  How about just doing a one-time mapping from the initial values of cr_locale.locale to those of ad_locale.locale, and throwing an error back to the upgrader for locales not in the mapping?  I'd like to get this into 5.2.
Collapse
Posted by Dave Bauer on
Jun,

Another way to handle the translation tree is to put the sortkeys in their own table and use the new faster sortkey design that is in acs_objects.

Using parent_id is not gererally useful because it represents the object/URL hierarchy in most applications.

Collapse
Posted by Dave Bauer on
Joel,

You upgrade strategy sounds good.

Collapse
Posted by Jun Yamog on
Hi Dave,

I guess we need to create a new table to keep track on how the items where copied/cloned.

Collapse
Posted by Lars Pind on
This overall looks sound. A few minor concerns ...

Why do you want to do this as a cr_child_rel instead of a cr_item_rel? The child rel stuff to me at least signifies that the child is "part of" the parent, a containment relationship. The canonical example is a photo for a news article.

cr_item_rel has exactly the same columns, with just two of them named differently (item_id/related_item_id instead of parent_id/child_id). item_rel signals that they're peers, instead of one being subordinated the other.

The other concern is with the terminology of localized/translated: Is that distinction clear enough from the terms alone? Will people get confused over which of the two means "it's the same item, just in a different language", and which means "this item was created by translating this other item"? Can the native English speakers out there come up with another pair of terms which more accurately clarifies the distinction, so developers won't have to look in the documentation to figure it out every time?

Otherwise it's cool.

/Lars

Collapse
Posted by Malte Sussdorff on
The upgrade should be plain and easy as we can use the nls_charset / nls_territory used in both tables to find out the correct ad_locale function. Not to mention that we only have US in cr_locales anyway in a new installation ....

I will do this upgrade on our system and write a TIP for it.

Collapse
Posted by Malte Sussdorff on
I am wondering about one thing though. Why do we bother with seperate items if we are saying that we need to unpublish / delete translated items once we delete the orginiating item.

Reason I am asking: If we are linking the translated items in a fixed manner to the original item and have to maintain all this, why not go the easy way and say:

There exists one item. This item has a latest revision and a published revision in it's default locale (cr_items.locale).

Additionally it can have other revisions that have other locales (cr_revisions.locale). A special mapping table (cr_item_locale_map) will map the the item_id, locale, published_revision, latest_revision.

This will not break any existing code, as latest and live revision in cr_items are still treated the same way. We would just add code to content::item::get_best_revision to look first in the mapping table and if it can't find anything according to the rules defined in point 9 of Joel's initial draft, then return the revision_id as before. This would also make it easier for existing applications to make use of internalization of content, as a lot are using get_best_revision already.

I would probably go and implement this for ETP unless there is an outcry saying we definitely need content_items for each language. And if the more complex solution is needed, I am absolutely in favour of someone implementing it, but my guess is, due to the fact that it was not implemented yet, it might as well never be. Not to mention that both ideas can coexist beautifully.

Collapse
Posted by Dave Bauer on
Malte,

I guess we would need to discuss the formal requirements for this. As far as I know, even for GCMS, which this plan was originated for and the CTK codebase which also was supposed to use it, it was never implemented. That is, the two projects that needed to support localized content did not use this proposal.

So it appears you are right, its too complex. Maybe we can discuss how xowiki supports pages in multiple languages now, and build on that, since its the only example of working code we have to refer to.

Collapse
Posted by Dave Bauer on
I'd suggest looking at LingaPlone, which is probably the closest thing to OpenACS that has a similar solution for localized content http://plone.org/products/linguaplone/

I think defining what actions the user would need to perform on content to localize it, will help define how we should implement it.

Collapse
Posted by Dave Bauer on
It appears LinguaPlone uses seperate objects for each translation with a link between them. That way, if you destroy the link the content is still there.

I think this supports the idea of one cr_item per translation. That way you can publish it seperately frm the orginal, when the translation is complete.

It also makes the code MUCH simpler. There is already support for almost all the features you need to support translation. The CR Tcl API supports creating and manageing items, and doesn't really support publishing more than one revision from a single object. It also supports adding the links between items.

To simplify the proposal we could make each item stand alone, and link them with cr_item_rels instead of cr_child_rels. This could possibly make it simpler. Then you could just query for links to see if a item in another language existed.

Collapse
Posted by Dave Bauer on
Anyone still interested in i18n of content?

I know xowiki has a technique of addng the language code to the name/url of the item which is cool for a wiki, but does seem to make it hard to query for the same content in another language.

I am just posting because someone mentioned content in multiple languages earlier ( i could not find the thread, but I see I posted about LinguaPlone before). I just wanted to remind folks to look at LiguaPlone for inspiration. It appears to be a good system for content in multiple languages.

Collapse
Posted by Malte Sussdorff on
We are still interested, but I think the XoWIKI method of doing this is actually quite fine. Name the item like "en:project_1234" and "de:projec_12123" and rewrite your application to take that naming convention into consideration. We might actually change content::item::new to take this into account automatically, so name the items with the locale of the user or the default locale and use the same method XoWIKI is using to figure this out.

But probably Gustaf could enlighten us more if it would be something generally useful for content items and then we could discuss this in a TIP.

Sadly, one problem I see is that nearly all packages work by item_id and not be item name. Therefore we would have to check on each call to get the contents of a revision or an item if the item exists in the users language. And how would we then deal with the scenario that the user explicitly wants to have the English version and not the German one?

I guess for the time being I just rewrite the applications in XoWIKI if I need this to work (e.g. lars blogger).