Forum OpenACS Q&A: Multilingual sites -- any catches for Japanese or Chinese?

I've worked on multilingual sites in the past, with Hebrew, English, and Arabic.  Now I've been asked by a potential client to work on a site that includes Western European languages *plus* Chinese and Japanese.  These languages don't have to be on the same page, but I expect that we'll want the database and applications to be unified.

I don't have any experience with Chinese or Japanese, in Unicode or otherwise.  Can anyone tell me if there are any special considerations that I should keep in mind?  Or should OpenACS work as seamlessly in these languages as it does in Hebrew and Arabic?

(Note that I'm not talking about localization of the interface into these languages, but rather the storage and retrieval of data in various modules, such as news, forums, and bug-tracker.)

Reuven

Henry Minsky has done Japanese sites, and at one point there was a small but active group of Japanese using ACS 3.x (Henry lived in Japan for a year or two IIRC)

This led to the charset patches in AOLserver3.3+ad13.  These have been adapted to AOLserver 4 (still in beta) but I don't know if anyone's done significant testing in Japanese or Chinese yet.

PostgreSQL works fine with Japanese - there are Japanese developers involved in the project.  Oracle, of course, should have no problem with any language charset either.

I know Greenpeace did testing with a variety of languages but I don't think they've actually rolled a live site with Japanese yet.

I did have a customer (dotcom that went under) that used AOLserver 3.3ad13 with Henry Minsky's patches with OpenACS 3.2.5.

From memory, here is what you need to know:

1.  The patches may reference one or two files that do not exist - these can be safely ignored.

2.  The patches may not cover a needed one or two line change to ns_sendmail or to the ecommerce mailing code.

3.  The big problem with configuring for Japanese when you can't read Japanese is that you will not be able to tell when the page is correct.

Wrongly configured servers will still set the charset such that you will see what looks like Japanese in the browser, but it will be immediately apparent as gibberish to someone who can read Japanese.

A possible fix is to have the customer make a sample html page in Japanese; load it on their browser and print it out (or take a screenshot) so you can visually determine correctness.

4.  I did not have any problems with the Postgres side of things.  I believe I did use the encoding of either Unicode or jp-2022 (I think) when creating the database.  You can do this by exporting the variable PG_ENCODING  or passing the proper flags to Postgres when you create the database.  I seem to recall setting PG_ENCODING as an environment variable when starting the AOLserver process as well.

Hope this helps.  If you run into other issues, I think I have a tarball of the relevant pieces. Just let me know what you need.

Reuven

We have almost finished translating to Korean (check: http://translate.dotlrn.collaboraid.net/) and I will have a student working in Arabic begining July.

I am not aware of any problems with the Korean version, and Japanese should be pretty similar.

I just noticed your last paragraph... how is that different form the localization of the interface?
Thanks for all of the advice so far.

To respond to Rafael, I am indeed *uninterested* in the localization of the user interface, at least at this point.  I mean, it would be a nice bonus to enjoy, but my main worry is that we will be able to store and retrieve information in Unicode without any trouble.

For example: We developed a site in OpenACS 4.5 that used Hebrew, Arabic, and English.  Everything worked swimmingly well, except for the issue of text alignment -- where Arabic and Hebrew need to be right-aligned, and English needs to be left-aligned.  And there are little BIDI problems like where the period goes at the end of a sentence, if the period is the final character in the sentence.  But aside from those niggling little issues, we didn't have to think about the different character sets once we modified nsd.tcl.

I just wanted to double-check that this would be true for Japanese and Chinese if we use Unicode, and that there are no unpleasant surprises awaiting us.  (Patrick's point about making sure that the Japanese doesn't just look OK, but is actually OK, is a very important and useful one.)

In theory, Unicode means that we can completely ignore the language in which things are stored -- that OpenACS will work with Chinese, Japanese, Russian, French, Hebrew, Arabic, Korean, and even cuneiform without modifications.  (Although cuneiform tablets don't always come with USB connectors.)

I just want to make sure that we're not going to discover all sorts of problems and issues down the road.

Collapse
Posted by MK Tam on
We have 5+ sites running on AOLserver 3.3+ad13 in Big5 Chinese (includes Greenpeace China's website).  There is no serious problem if your parameter in nsd.tcl set correctly, but as AOL treats charset internally in Unicode, at least two extended codepages (Hong Kong Supplementary Character Set and ETen Character set) cannot be converted and becomes ??.

I don't know how the internal conversion is done and whether it is possible to upgrade the code table that includes the mentioned two codepage.