ACS 4 Globalization Detailed Design

by Henry Minsky

I. Essentials

When applicable, each of the following items should receive its own link:

II. Introduction

III. Historical Considerations

V. Design Tradeoffs

Areas of interest to developers:

VI. API

VI.A Locale API

10.30 A Locale object represents a specific geographical, political, or cultural region. An operation that requires a Locale to perform its task is called locale-sensitive and uses the Locale to tailor information for the user. For example, displaying a number is a locale-sensitive operation--the number should be formatted according to the customs/conventions of the user's native country, region, or culture.

We will refer to a Locale by a combination of a language and country. In the Java Locale API there is an optional variant which can be added to a locale, which we will omit in the Tcl API.

The language is a valid ISO Language Code. These codes are the lowercase two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as:
http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt

The country is a valid ISO Country Code. These codes are the uppercase two-letter codes as defined by ISO-3166. You can find a full list of these codes at a number of sites, such as:
http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html

Examples are

en_US English US
ja_JP Japanese
fr_FR France French.

The i18n module figures out the locale for a current request makes it accessible via the ad_locale function:

[ad_locale user locale
] => fr_FR
[ad_locale subsite locale
] => en_US
It has not yet been decided how the user's preferred locale will be initialized. For now, there is a site wide default package parameter [parameter::get -parameter DefaultLocale -default "en_US"] , and an API for setting the locale with the preference stored in a session variable: The ad_locale_set function is used to set the user's preferred locale to a desired value. It saves the value in the current session.
ad_locale_set locale "en_US"
       will also automatically set [ad_locale user language]
          ( to "en" in this case)

    ad_locale_set timezone "PST"

    
The request processor should use the ad_locale API to figure out the preferred locale for a request (perhaps combining user preference with subsite defaults in some way). It will make this information accessible via the ad_conn function:
ad_conn locale

Character Sets and Encodings

We refer to MIME character set names which are the valid values which can be passed in a MIME header, such as
Content-Type: text/html; charset=iso-8859-1

You can obtain the preferred character set for a locale via the ad_locale API shown below:

set locale "en_US"
[ad_locale charset $locale
] => "iso-8859-1" or "shift_jis"
Returns a case-insensitive name of a MIME character set.

We already have an AOLserver function to convert a MIME charset name to a Tcl encoding name:

[ns_encodingforcharset "iso-8859-1"] => iso8859-1

Templating

The goal of templates is to separate program logic from data presentation.

For presenting data in multiple languages, there are two basic ways to use templates for a given abstract URL. Say we have the URL "foo", for example. We can provide templates for it in the following ways:

Both styles of authoring templates will probably be used; For pages which contain a lot of free form text content, then having a separate template page for each language would be easiest.

But for a page which has a very fixed format, such as a data entry form, it would mean a lot less redundant work to use a single template source page to handle all the languages, and to have all language-dependent strings be looked in a message catalog. We can do this either by creating data sources which call lang_message_lookup, or else use the <TRN> tag to do the same thing from within an ADP file.

Caching multilingual ADP Templates

Message catalog lookups can be potentially expensive, if many of them are done in a page. The templating system can already precompile and cache adp pages. This works fine for a page in a specific language such as foo.en.adp , but we need to modify the caching mechanism if we want to use a single template file to target multiple languages.

Computing the Effective Locale

Let's say you have a template file "foo.adp" and it contains calls to look up message strings using the TRN tag:

<master>
<trn key=username_prompt>Please enter your username</tr>
<input type="text" name=username>
<p>
<trn key=password_prompt>Enter Password:</trn>
<input type=password name=passwd>
If the user requests the page foo , and their ad_locale is "en_US" then effective locale is "en_US". Message lookups are done using the effective locale. If the user's locale is "fr_FR", then the effective locale will be "fr_FR".

If we evaluate the TRN tags at compile time then we need to associate the effective locale in which the page was evaluated with the cached compiled page code.

The effective locale of a template page that has an explicit locale, such as a file named "foo.en.adp" or "foo.en_US.adp", will be that explicit locale. So for example, even if a user has a preferred locale of "fr_FR", if there is only a page named "foo.en.adp", then that page will be evaluated (and cached) with an effective locale of en_US.

VI.B Naming of Template Files To Encode Language and Character Set

10.40 The templating system will use the Locale API to obtain the preferred locale for a page request, and will attempt to find a template file which most closely matches that locale.

We will use the following convention for naming template files: filename.locale_or_language.adp.

Examples:

foo.en_US.adp
foo.en.adp

foo.fr_FR.adp
foo.fr.adp

foo.ja_JP.adp
foo.ja.adp

The user request has a locale which is of the form language_country. If someone wants English, they will implicitly be choosing a default, such as en_US or en_GB. The default locale for a language can be configured in the system locale tables. So for example the default locale for "en" could be "en_US".

The algorithm for finding the best matching template for a request in a given locale is given below:

  1. Find the desired target locale using [ad_conn locale] NOTE: This will always be a specific Locale (i.e., language_COUNTRY)
  2. Look for a template file whose locale suffix matches exactly.

    For example, if the filename in the URL request is simply foo and [ad_conn locale] returns en_US then look for a file named foo.en_US.adp.

  3. If an exact match is not found, look for template files whose name matches the language portion of the target locale.

    For example, if the URL request name is foo and [ad_conn locale] returns en_US and a file named foo.en_US.adp is not found, then look for all templates matching "en_*" as well as any template which just has the "en" suffix.

    So for example if the user's locale en_GB and the following files exist:

    foo.en_US.adp

    then use foo.en_US.adp

    If however both foo.en_US.adp and foo.en.adp exist, then use foo.en.adp preferentially, i.e., don't switch locales if you can avoid it. The reasoning here is that people can be very touchy about switching locales, so if there is a generic matching language template available for a language, use it rather than using an incorrect locale-specific template.

  4. If no locale-specific template is found, look for a template matching just the language

    I.e., if the request is for en_US, and there exists a file foo.en.adp, use that.

  5. If no locale-specific template is found, look for a simple .adp file, such as foo.adp.

Once a template file is found we must decide what character set it is authored in, so that we can correctly load it into Tcl (which converts it to UTF8 internally).

It would be simplest to mandate that all templates are authored in UTF8, but that is just not a practical thing to enforce at this point, I believe. Many designers and other people who actually author the HTML template files will still find it easier to use legacy tools that author in their "native" character sets, such as ShiftJIS in Japan, or BIG5 in China.

So we make the convention that the template file is authored in its effective locale's character set. For multilingual templates, we will load the template in the site default character set as specified by the AOLserver OutputCharset initializatoin parameter. For now, we will say that authoring generic multilingual adp files can and should be done in ASCII. Eventually we can switch to using UTF8.

A character set corresponding to a locale can be found using the [ad_locale charset$locale] command. The templating system should call this right after it computes the effective locale, so it can set up that charset encoding conversion before reading the template file from disk.

We read the template file using this encoding, and set the default output character set to it as well. Inside of either the .adp page or the parent .tcl page, it is possible for the developer to issue a command to override this default output character set. The way this is done is currently to stick an explicit content-type header in the AOLserver output headers, for example to force the output to ISO-8859-1, you would do

ns_set put [ns_conn outputheaders] "content-type" "text/html; charset=iso-8859-1"       
design questionWe should have an API for this. The hack now is that the adp handler adp_parse_ad_locale user_file looks at the output headers, and if it sees a content type with an explicit charset, it passes it along to ns_return.

The default character set for a template .adp file should be the default system encoding.

VI.C Loading Regular Tcl Script Files

10.50 By default, tcl and template files in the system will be loaded using the default system encoding. This is generally ISO-8859-1 for AOLserver running on Unix systems in English.

This default can be overridden by setting the AOLserver init parameter for the MIME type of .tcl files to include an explicit character set. If an explicit MIME type is not found, ns_encodingfortype will default to the AOLserver init parameter value DefaultCharset if it is set.

Example AOLserver .ini configuration file to set default script file and template file charset to ShiftJIS:

ns_section {ns/mimetypes }
...
ns_param .tcl {text/plain; charset=shift_jis}
ns_param .adp {text/html; charset=shift_jis}

ns_section ns/parameters
...
# charset hacking
ns_param HackContentType 1
ns_param URLCharset shift_jis
ns_param OutputCharset shift_jis
ns_param HttpOpenCharset shift_jis
ns_param DefaultCharset shift_jis

VI.A Message Catalog API

We want to use something like the Java ResourceBundle, where the developer can declare a set of resources for a given namespace and locale.

For AOLserver/TCL, to make the message catalog more manageable, we will split it into one message catalog per package, plus one default global message namespace in case we need it. So for example,

Message lookups are done using a combination of a key string and a locale or language, as well as an implicit package prefix on the key string. The API for using the message catalog is as follows:

lang_message_lookuplocalekey [default_string]
lang_message_lookup is abbreviated by the procedure named "_", which is the convention used by the GNU strings message catalog package.
The locale arg can actually be a full locale, or else a simple language abbrev, such as fr , en , etc. The lookup rules for finding strings based on key and locale are tried in order as follows:
  1. Lookup is first tried with the full locale (if present) and package.key
  2. Lookup is tried with just the language portion of the locale and package.key
  3. Lookup is tried with the full locale and key without package prefix.
  4. Lookup is tried with language and key without package prefix.
Example: You are looking up the message string "Title" in the notes package.
[lang_message_lookup $locale notes.title "Title"]

can be abbreviated by
[_ $locale notes.title "Title"]

# message key "title" is implicitly with respect to package key
#  "notes", i.e., notes.title
[_ $locale title "Title"]

The string is looked up by the symbolic key notes.title (or title for short), and the constant value "Title" is supplied as documentation and as a default value. Having a default value allows developers to code their application immediately without waiting to populate the message catalog.

Default Package Namespace

By default, keys are prefixed with the name of the current package (if a page request is being processed). So a lookup of the key "title" in a page in the bboard package will actually reference the "bboard.title" entry in the message catalog.

You can override this behavior by either using a fully qualified key such as bboard.title or else by changing the message catalog namespace using the lang_set_package command:

[lang_set_package "bboard"]
So for example code that runs in a scheduled proc, where there is not necessarily any concept of a "current package", would either use fully qualified keys to look up messages, or else call lang_set_package before doing a message lookup.

Message Catalog Definition Files

A message catalog is defined by placing a file in the catalog subdirectory of a package. Each file defines a set of messages in different locales, and the file is written in a character set specified by its file suffix:
/packages/bboard/catalog/
                         bboard.iso-8859-1
                         bboard.shift_jis
                         bboard.iso-8859-6
A message catalog file consists of tcl code to define messages in a given language or locale:
_mr en mail_notification "This is an email notification"
_mr fr mail_notification "Le notification du email"
...

In the example above, if the catalog file was loaded from the bboard package, all of the keys would be prefixed automatically with "bboard. ".

Loading A Message Catalog At Package Init Time

The API function
lang_catalog_loadpackage_key
Is used to load the message catalogs for a package. The catalog files are stored in a package subdirectory called catalog . Their filenames have the form *.encoding.cat , where encoding is the name of a MIME charset encoding (not a Tcl charset name as was used in a previous version of this command).
/packages/bboard/catalog
                        /main.iso8859-1.cat
                        /main.shift_jis.cat
                        /main.iso-8859-6.cat
                        /other.iso8859-1.cat
                        /other.shift_jis.cat
                        /other.iso-8859-6.cat

You can add more pseudo-levels of hierarchy in naming the message keys, using any separator character you want, for example

_mr fr alerts.mail_notification "Le notification du email"
which will be stored with the full key of bboard.alerts.mail_notification .

Calling the Message Catalog API from inside of Templates

Inside of a template, you can always make a call to the message catalog API via a Tcl escape:
<%= [_ $locale bboard.passwordPrompt "Enter Password"]%> 
However, this is awkward and ugly to use. We have defined an ADP tag which invokes the message catalog lookup. As explained in the previous section, since our system precompiles adp templates, we can get a performance improvement if we can cache the message lookups at template compile time.

The <TRN> tag is a call to lang_message_lookup that can be used inside of an ADP file. Here is the documentation:

Procedure that gets called when the <trn> tag is encountered on an ADP page. The purpose of the procedure is to register the text string enclosed within a pair of <trn> tags as a message in the catalog, and to display the appropriate translated string. Takes three optional parameters: lang, type and key.
  • key specifies the key in the message catalog. If it is omitted this procedure returns simply the text enclosed by the tags.
  • lang specifies the language of the text string enclosed within the flags. If it is omitted value defaults to English.
  • type specifies the context in which the translation is made. If omitted, type is user which means that the translation is provided in the user's preferred language.
  • static specifies that this tag should be translated once at template compile time, rather than dynamically every time the page is run. This will be unneccessaru and will be deprecated once we have implemented effective locale based caching for templates.
Example 1: Display the text string Hello on an ADP page (i.e. do nothing special):
    <trn>Hello</trn>
    
Example 2: Assign the key key hello to the text string Hello and display the translated string in the user's preferred language:
    <trn key="hello">Hello</trn>
    
Example 3: Specify that Bonjour needs to be registered as the French translation for the key hello (in addition to displaying the translation in the user's preferred language):
    <trn key="hello" lang="fr">Bonjour</trn>
    
Example 4: Register the string and display it in the preferred language of the current user. Note that the possible values for the type parameter are determined by what has been implemented in the ad_locale procedure. By default, only the user type is implemented. An example of a type that could be implemented is subsite, for displaying strings in the language of the subsite that owns the current web page.
    <trn key="hello" type="user">Hello</trn>
    

Example 5: Translates the string once at template compile time, using the effective local of the page.

    <trn key="hello" static>Hello</trn>
    

VII. Data Model Discussion

Internationalizing the Data Models

Some data which is stored in ACS package and core database tables may be presented to users, and thus may need to be stored in multiple languages. Examples of this are the descriptions of package or site parameters in the administrative interface, the "pretty names" of objects, and group names.

Tables which are in acs kernel and have user-visible names that may need to be translated in order to create an admin back end in another language:

user groups:
   group_name

acs_object_types:
   pretty_name
   pretty_plural

acs_attributes:
   pretty_name
   pretty_plural

acs_attribute_descriptions
   description (clob)

procedure add_description- add a lang arg ?

acs_enum_values ? pretty_name

acs_privileges: 
  pretty_name
  pretty_plural

apm_package_types
  pretty_name
  pretty_plural


apm_package "instance_name"? Maybe a given instance
gets instantiated with a name in the desired language?


apm_parameters: 
   parameter_name
   section_name
One approach is to split a table into two tables, one holding language-independent datam, and the other holding language-dependent data. This approach was described in the ASJ Multilingual Site Article .

In that case, it is convenient to create a new view which looks like the original table, with the addition of a language column that you can specify in the queries.

Drawbacks to Splitting Tables

It is not totally transparent to developers
Every query against the table which requests or modifies language-dependent columns must now include a WHERE clause to select the language.

Extra join may slow things down
The extra join of the two tables may cause queries to slow down, although I am not sure what the actual performance hit might be. It shouldn't be too large, because the join is against a fully indexed table.

VIII. User Interface

IX. Configuration/Parameters

X. Code Examples

XI. Future Improvements/Areas of Likely Change

XII. Authors

XII. Revision History

The revision history table below is for this template - modify it as needed for your actual design document.

Document Revision #Action Taken, NotesWhen?By Whom?
0.1Creation12/4/2000Henry Minsky
0.2More specific template search algorithm, extended message catalog API to use package keys or other namespace12/4/2000Henry Minsky
0.3Details on how the <TRN> tag works in templates12/4/2000Henry Minsky
0.4Definition of effective locale for template caching, documentation of TRN tag12/12/2000Henry Minsky

hqm@arsdigita.com