Forum OpenACS Development: Implementing locale-aware character encodings

Here is a summary of what I know on the topic for Lars or anyone else enhancing acs-lang. I'll address the infrastructure already in place in acs-lang, further work that's been done for Greenpeace Planet, and what I think needs to be addressed for a more general solution to the problem. A general solution is of course more ambitious than one that works in Planet.

AFAIK, any system using only one character set only should have no character encoding problems with OpenACS in its current state: it's just a matter of setting up your AOLserver config file to send the right charset along with the mimetype. (See the links below to other resources for details on how to to this.) For example, if you are only dealing with Western European languages on your sight, this can be handled by setting up your config file accordingly. If you need to switch character encodings within one AOLserver instance, your code has to do the switching.

A lot of the hassle of implementing locale-specific character-encodings would be unnecessary if unicode (utf-8) could safely be sent to browsers (you might still run into the problems reading in non-unicode files on disk, though). If you set your preferences to Japanese in Google, you'll get utf-8 encoded pages sent back to you. Yahoo Japan uses euc-jp. Different organizations are obviously making different calls on whether unicode is ready for prime time... possibly based on the technical challenges of not using Unicode. Google, for example, has a multilingual site, and that might explain their decision to be Unicode early adapters.

Background

Relevant documents:

AOLserver charset api

arsDigita created an AOLserver api for handling charsets that is included in AOLserver 3.3 ad13. Have a look at the charsets tcl module. I think this code is undocumented outside the documents linked above.

The calls include ns_urlcharset, ns_charset, ns_cookiecharset, ns_formfieldcharset, and ns_encodingforcharset.

Tcl encodings and internet charsets

This part's especially arcane. :) The AOLserver tcl api is conversant with the charsets commonly used on the internet. You can get a list of all the ones AOLserver knows about by calling [ns_charsets]. Tcl also has the ability to convert between encodings / charsets, and understands a whole bunch of different charsets. Type

encoding names

...from tclsh and you'll get another nice list of names... only many of the names are slightly different. Cool huh? You'll need to use one or the other depending on whether you're passing a charset to an AOLserver tcl api call...

ns_urlcharset "shift_jis"
... or a core tcl command:
fconfigure $fd -encoding "shiftjis"

Once you've marveled at this a little, you can relax. ns_encodingforcharset will convert an internet charset (which is what acs-lang will supply for a locale) to a tcl encoding.

How AOLserver encodes output to browsers

When you call:

ns_return 200 "text/html; charset=shift-js" "bla bla"

AOLserver returns http header:

Content-Type: text/html; charset=shift-js

...AND encodes the data in shift-js.

When you call

ns_return 200 "text/html" "bla bla"

...AOLserver will include the header:

Content-Type: text/html; charset=foo

...AND encodes the data in foo charset IF the AOLserver config file has an ns/parameters OutputCharset=foo and ns/parameters HackContentType=1. (The default is iso-8859-1 and 1 respectively.)

When do character encodings come into play?

Unicode is used internally by the tcl interpreter, and Oracle and Postgresql store strings in unicode (not sure whether this needs special configuration in either case.) So unicode's at the center of things, but you typically input from and output to different character encodings.

  • When form data is read, it needs to be decoded according to the charset used in the page that sent the form.
  • Files uploaded by form may need to be decoded if they are being read into tcl or saved to the database.
  • When a file of any sort - typically a template - is read, it needs to be converted from the appropriate character set to unicode.
  • The encoding that you are sending to the browser should be explicitly specified along with the mimetype in the http header and in a meta tag in the document.
  • The bytes going to the browser need to be encoded in the appropriate charset.

Returning an html file to a browser without any processing is a somewhat different case. I don't discuss this but it's addressed somewhere in the documents linked above.

Determining the charset

It follows that you need to determine the correct encoding early in the thread and you need to be able to retrieve it several times subsequently (whether it's the same charset for each case is something I'll address later.)

acs-lang records the internet charset for a given locale. It can be retrieved by calling:

[ad_locale charset $locale]

The temporary solution proposed for acs-lang was to call:

ns_set put [ns_conn outputheaders] "content-type" "text/html;
$charset"

...storing the charset for later retrieval. Sure enough, if you dig around in acs-templating, you find that a call to template::get_mime_type tries to retrieve the charset / mimetype from the outputheaders prior to returning a page. I don't believe there's anything in acs currently that stuffs the mimetype/encoding in the header. So the call to ns_return specifies a mimetype with no charset and then, depending on the content of your aolserver config file, aolserver will probably append a charset to the mimetype it sends to the browser (and encode the bytes accordingly).

I'm not so fond of sticking the mimetypes in the outputheaders; it confused the hell out of me for a while. One glitch that arises using the outputheaders to store stuff mimetypes is that ns_return automatically appends an additional content-type line to the http headers, so you wind up with two!

Suggestions

A call to [ad_locale charset $locale] will hit the database. It should probably be changed to return a cached value.

If it's not expensive to retrieve your locale, the locale doesn't change in the middle of a thread (it shouldn't), and [ad_locale charset $locale] returns a cached value, then there's no reason to store the charset anywhere, IMHO, it's a derived value.

If you decide to stay with storing the charset in the output headers (it's called a hack in the acs-lang design document, but hey it works) you might want to pull the content-type line out of the outputheaders ns_set when you return the page, so you don't end up with a redundant http header line (don't give the browser an excuse to do something weird.)

Planet

For Planet, we had somewhere else to store the charset: the gp_conn array. So I removed existing code that put charsets in the outputheaders, and set up Planet to cache charsets along with locales in gp_conn. That's a very Planet-specific approach and one that, it dawns on me now, might make it harder to make Planet re-use acs-lang code as it evolves. We'll see...

Returning the locale-specific charset to the browser

For Planet I put the call to determine the charset in a custom version of doc_return. A custom version of adp_parse_ad_conn file calls doc_return rather than ns_return, and doesn't call template::get_mime_type. You can use the modified doc_return for non-templated pages.

There are still plenty of non-localized admin and error pages in acs that are returned with ns_return, and so down the road all this might need to be revisited. I'm not so hot on hacking ns_return when you can use a wrapper with a different name like doc_return. Maybe a matter of taste...

Finally, the procedure that returns the charset for planet has a hack that sets the the charset to utf-8 (unicode) whenever you're on an admin page. We have a non-localized gp-admin where people will be entering all sorts of characters into forms (hopefully using modern browswers that support unicode) so unicode seems like the way to go. Don't know how relevant that is to what you're doing.

Form data

The encoding of a page determines how the broswer encodes form data it sends back to the server, meaning that what you sent determines what you get back. In Planet, I'm not aware of any instances where the locale currently changes across form submissions (except maybe when entering the administration pages, but that part of the site works in unicode). The straightforward solution seems to be to figure out the charset you're going to return before you retrieve data from the form, and then set the decoding of form data with a call to

ns_urlcharset $charset

in rp_filter.

Suggestions

I haven't really thought through where you might need to make provisions for changing locale and charset across form submissions. If you do need to accommodate this situation, one way to do it would be to make ad_form automatically include a hidden input with the character encoding in every form. Then a call to ns_formfieldcharset somewhere in the request processor can determine the encoding before accessing the form data.

Form file upload

The only text files you upload in gp-admin are .css files, which should be ASCII. So I haven't done anything with this. If you are supporting file upload, you can bet the files won't be utf-8. The problem is discussed in the Mayoff document. Good luck! :)

Localized template files

The ACS Globalization Design Document proposes using a file naming convention to specify the encoding of template files (see "Naming of Template Files To Encode Language and Character Set"). I'm still sorting this out for Planet. The likely solution will be that, when reading in a system-wide template, we'll use latin-1 encoding to read the file. When reading in a localized template, we'll use whatever encoding acs-lang specifies for the locale. I think the relevant code is template::util::read_file.

Note: you need to use tcl Encoding here, thus something like this if called for:

set fd [open somefile.adp r]
fconfigure $fd -encoding [ns_encodingforcharset [ad_locale charset
$locale]]

We'll need to work out whether the encoding of localized templates matches the encoding specified by acs-lang. Submitting localized templates by pasting them into a form and then storing them in the database (or writing them to disk in Unicode) seems like it might be a more foolproof system in the end for doing this.

acs-lang admin pages

The gp/acs-lang admin pages should probably be smart about alerting users if they try to enter a MIME charset that AOLserver doesn't understand when setting up a locale. [ns_charset] can be used for the check.

That's all folks!

The mechanism suggested above for determining the charset to return will always choose a specific charset and never utf8, even if the browser indicated that it can handle (and prefers) utf8 in the Accept-Charset header. I would suggest to add a call to ns_choosecharset to the mechanism, which already implements a selection mechanism based on the preferred charset list given by the browser.
My take on Tilman's suggestions was that using ns_choosecharset to determine the charset would complicate troubleshooting character display problems down the line... but that's just a hunch.

It would be nice to know what percentage of browsers hitting Planet DO accept utf-8, as well as what other charsets are accepted for Asian locales that potentially use different charsets (for example, shit_jis and euc-jp are used in japan). I wonder if this information can be deduced from browser stats, or if it would be worthwhile writing a bit of code that asks browsers for charset preferences and records them (along with client IP and locale) in a table?

Has anybody written code to collect statistics on the charsets accepted by their visitors (I guess simply recording the Accept-Charset header would be sufficient)?

Or does anyone know of other sources on the internet that provide some statistics on this, especially on the question how widespread utf-8 aware browsers already are?