Forum OpenACS Q&A: Response to Globalization

Collapse
7: Response to Globalization (response to 1)
Posted by Tom Jackson on

You also need a useful discussion on configuration for multiple languages. There have been several threads concerned with this. Questions are:

  • How to get your database to handle multiple languages?
  • How to configure AOLserver to handle multiple languages?
  • How to setup a testing environment to find bugs?

I have started to work on these issues because I have a site that uses Italian, and another that wishes to use Farsi, running from the same AOLserver instance.

In my case using the UTF-8 encoding of Unicode seems the only possibile solution. I didn't have any trouble configuring PostgreSQL to handle Unicode. I followed the advice on other threads for the configuration. Also, AOLserver was easily configured, I think. The main issue so far is setting up a testing environment to actually verify that these two big chunks of software are actually working.

To help, I have started to collect a few test pages. These are currently grouped at http://zmbh.com/utf-8/.

Here is a description of a few of these:

  • http://zmbh.com/utf-8/UTF-8-test.txt tests a bunch of stuff related to the UTF-8 parser. You can use this to test if your display's parser has any bugs. So far I haven't found a web browser that doesn't have a few, but this doesn't necessarily mean that you cannot view correct UTF-8 characters, only that incorrect characters are not handled correctly. This could lead to security problems. The only display that I have working is an xterm in -u8 mode. You probably need something similar to RedHat 7.0+, then you need to install 10646-1 fonts and start xterm with a command similar to:
    LC_TYPE=en_US.UTF-8 xterm 
     -fn '-Misc-Fixed-Medium-R-Normal--15-140-74-75-C-90-ISO10646-1'
    
    The UTF-8 test file has the feature that if your parser is working, each line of the file is 79 chars plus a newline. The 79th char is '|', so you get a nice line down the right side that should line up.
  • http://zmbh.com/utf-8/utf8.html has many languages on it and can be used to check if the font you are using supports the language you want. Not every font supports characters from every language. A correct parser will 'replace' characters it does not have a glyph for with it's replacement character. Sometimes this is a question mark that looks a little weird, or an upside down question mark, or in the case of xterm, an dotted outline of the boundary of the character. It is not a bug in your parser to show these replacement characters, it is a bug if other strange, obviously non-language characters show up. It means your parser did not correctly find the multibyte character.

    It might still be possible that something along the line mangled the file. What seems to work for me is to download the file with wget and cat the file with the correctly configured xterm.

  • http://zmbh.com/utf-8/fconfigure.tcl opens the utf8.html file and configures the channel and reads the data into a string. It then uses ns_return to return the string. I used wget for this file as well and then used diff to figure out why the lengths were different. Everything was the same except the Vietnamese line. Maybe there is a bug in ns_return?