Forum OpenACS Development: Implementing locale-specific character encodings in gp/acs-lang

I'm working on how Greenpeace Planet sends locale-specific character encodings, and was hoping for some feedback on one of the implementation details. Eventually, acs-lang will need to address the same problem, and I thought a shared solution would be better.

For the impatient: I need to make sure that every call to ns_return or the like (ns_returnnotice) supplies the correct mimetype/character encoding for a given locale, and I'm not sure what the best place to make the change is.

The gory details:

The basic problem is this: If your site only features a single character set, such as iso-8859-1 for Western European languages, it is easy enough to configure AOLserver to automatically return the right character encoding. If your site features a mix of languages that use different character sets, acs-lang / gp-lang records a character encoding for each locale and can tell you which one to use. Now, you want want to send the right character set to the client. What does "sending" a given character encoding to the client mean? You want

  1. the http headers that are sent to correctly specify the character encoding along with the mime type in the "Content-Type" line,
  2. ditto for the mime type / character encoding specified in the meta tag in the portion of the html and
  3. the bytes that are sent to the browser need to be correctly encoded.

Assuming you've written a procedure that will return the locale-specific charset, number two is easy - you just need your templates to call the procedure when they write the meta tag. One and three are a little more complicated. Fortunately, Rob Mayoff wrote a document that explains the messy details. Unless you want to use ns_write to specify the headers yourself, what you need to do is specify the character encoding explicitly when you call ns_return or one of its cousins like ns_returnnotice. Thus

ns_return 200 "text/html; charset=shift-js" "bla bla"

will include the character set in the header and tell AOLserver how to encode the data.

It looked like there was another option: you can access the output headers as an ns_set through [ns_conn outputheaders] at any point in the thread before you return something to the browser. The problem is that ns_return appends a mimetype to the output headers... so if you try to stuff in the mime type ("Content-Type") beforehand, you'll wind up with a second "Content-Type" line created by ns_return.

So what's the best way to include the character set? The easy solution seems to be to just include a modified version of doc_return in the gp-lang package - doc_return seems to be what the templating system calls when it returns a normal page. Then, of course, you have to make sure non-templated pages call doc_return (regretably an issue with Planet). At any rate, this seems like a partial solution at best, since the ACS core returns lots of error pages and so on that don't call doc_return (the error pages are hardcoded in english of course). The only other thing that comes to mind is trying to hack ns_return its cousins. Any thoughts?

Maybe I'm missing something here, but why go through the pain of rewriting HTTP headers, adding meta tags, and keeping track of a user's preferences, when you can simply use Unicode?

Yes, it's a pain to get content people to use Unicode, and old browsers don't handle it intelligently.  But I've found Unicode to be a great solution to precisely this problem, freeing me from having to worry about where people are from and what characters they want to see.

Also, the meta tag only assigns the character set if the HTTP headers fail to do so.  So if you always indicate the character set explicitly in the Content-type HTTP header, then you shouldn't need the meta tag.

Well, that's what they do at babelfish.altavista.com (use Unicode). Unicode is of course more elegant. The feedback I've gotten is that it's not a good idea yet to send Unicode on a site, like Greenpeace Planet, where wide browser compatibility is a goal.

Dumping the meta tag sounds promising!

Note that even the standard tcl/adp pair handler - adp_parse_ad_conn_file - does not call doc_return but ns_return directly, because it does it's own version of releasing unused db handles. As you said there are propably numerous places that call ns_return. If you want to change all those you'd have to touch a lot of files ...

What do you think about redefining ns_return with yet another wrapper that appends the charset to the content-type parameter if it is not yet appended (consulting ns_choosecharset and the user's language setting) - would that be too inefficient? It'd propably be the method with the least effort.

Yeah (sheepish) I was a little confused about what does the return in the template system when I wrote the original post. All you have to do is tweak template::get_mime_type and you're done with the templated pages I think.

For other pages, yes a wrapper for ns_return sounds like a good idea to me. It looks like the ns_return is written in C not tcl, and I don't know what all you do to intercept it and then call it. For Planet, a simple hack to doc_return should suffice for what we need:

ad_proc -public doc_return {args} {

    Replaces acs doc_return: modified to use locale-specific charset.

    @ author Alex Sokolofff (alex_sokoloff@yahoo.com)

} {
    db_release_unused_handles

    # strip out any charset information included
    regsub -all {; charset=(.+)} [lindex $args 1] "" mimetype

    # add charset specified by gp-lang
    set mimetype "${mimetype}; charset=[gp_determine_charset]"

    set args [lreplace $args 1 1 $mimetype]

    eval "ns_return $args"
}

Again, most of the pages here and there in the system that will be returned with ns_return or one of its variants are hard-coded in English at this point, so it wouldn't really make sense to return them with a shift-js charset even on a japanese portion of a site. So maybe there isn't much of a problem here after all!

To wrap ns_return first rename it to something else (rename ns_return _real_ns_return), then define your own proc that takes the same # of arguments as the original ns_return. Within that proc when you're ready to return data to the user, call the _real_ns_return.
After you're done recompiling and testing aolserver, of course... Of course whacking the source of aolserver seems like a rather extreme and inconventient solution for this problem...
Uh, no recompilation required... just use the Tcl rename command. No re-compliation or C-hacking required.
So if you always indicate the character set explicitly in the Content-type HTTP header, then you shouldn't need the meta tag.

In theory, yes. In practice, there are situations when various browsers will behave inconsistently and plain weird inless you list the charset in both places. Also, it's nice to keep the charset information in an HTML page saved by the user (hence meta tag).