Forum OpenACS Q&A: Dealing with non-Roman character sets

1: Dealing with non-Roman character sets

Posted by Daryl Biberdorf on 01/03/02 05:11 PM

I'm working on building a site for theological discussion, using OpenACS. Real theological discussion :) has need of words from languages that don't use Roman characters (e.g., Greek and Russian). As I hate transliteration (representing the characters from one language with Roman equivalents), I'm wondering how one would deal with mixed-language pages. That is, most of the page would be in English (or another Western language), while individual words and phrases might use Greek or Cyrillic characters. Has anyone done this?

2: Response to Dealing with non-Roman character sets (response to 1)

Posted by Henry Minsky on 01/03/02 05:37 PM

You need Unicode, my man!

Unicode has a character code for every written human language,
for the most part.

Now, getting the correct font and displaying correctly on your computer is another issue. Microsoft browsers can generally display multilingual data in Unicode. The most common encoding used
is UTF-8, a variable-length code which is quite compact for ASCII
and western languages.

Check out http://www.geocities.com/i18nguy/unicode-example.html for
an example.

Getting ACS to properly encode data in Unicode requires a little bit
of setup. Are you using OpenACS 3 or OpenACS 4?

I just started running OpenACS4 a few weeks ago. I had some patches for
Arsdigita's ACS 4 to use Japanese and so forth by using UNicode internally in the database, and providing hooks to AOLserver to
tell it what character set encoding to accept and emit (if you are
letting people enter text via forms, you may need to know what
encoding the text is in, which is somewhat f*cked because of
lack of standards amongst browsers).

Anyway, I have some patches for ACS 3.2.5 that will let you do
Unicode, and I will try to work up an official set of changes that
could go into OpenACS4. That will require a little bit of effort, but
most of the groundwork is there, AOLserver and Postgres both handle Unicode (and so does Oracle) it's just a few lines of code here
and there in ACS to set the encodings when pages are generated or
forms variables are read from browsers.

3: Response to Dealing with non-Roman character sets (response to 1)

Posted by Andrew Grumet on 01/03/02 05:45 PM

utf-8 allows you to display symbols from multiple languages on a single page, provided that your users have the necessary fonts to view them installed. The later ArsDigita AOLserver distros have nice support for multiple character sets, including utf-8. There is reasonable documentation here: http://www.arsdigita.com/asj/multilingual/.

Working with utf-8 requires careful attention to webserver, database, environment variable and (when debugging) terminal settings. I have had success on client sites with Oracle; haven't tried with PG.

4: Response to Dealing with non-Roman character sets (response to 1)

Posted by Don Baccus on 01/03/02 08:18 PM

Henry - glad to see you posting here again! It's been too long ...

Now that I've rewritten the hierarchical query/tree sortkey stuff to use BIT VARYING rather than text, not only should UTF-8 work but you should be able to set your locale and use the proper collation sequence for your language of choice. If you're using PG just remember that you need to do this before you do your post-install INITDB.

Hmmm...this works great for, say, an all-Japanese site but not so well for a mixed-language site. The PG folks have been talking about possibly allowing a little more flexibility in the future.

Anyway ... Henry, if you do work up a set of patches for OpenACS 4 and can test them for both Oracle and PG people would really appreciate it. If you could do so in the next couple of weeks we'd appreciate it even more because that's my rough notion as to when a beta release might be cut. A very rough notion as I've not discussed the documentation schedule with Roberto, yet.

5: Response to Dealing with non-Roman character sets (response to 1)

Posted by Jade Rubick on 01/03/02 08:26 PM

Is there any reason that we shouldn't just set up the default installation to be Unicode based? I'm wondering if this should just be the default way to set up OpenACS (and Postgres and Aolserver).

I'm speaking from ignorance here... Just curious.

6: Response to Dealing with non-Roman character sets (response to 1)

Posted by Don Baccus on 01/03/02 09:57 PM

It kills "LIKE" optimization in PG. While that's not a big deal for OpenACS 4 per se, it might be for custom code based on the toolkit.

I'm not sure what the distros are doing regarding PG RPMs (or other packages) shipped and installed by default. I've not tried to follow this. We might want to follow their lead. But I think we'd want Henry's patches first ??? (if "their lead" means "use UTF-8"?)

Tcl 8 is always UTF-8 - that's what broke ns_uuencode, for instance.

7: Response to Dealing with non-Roman character sets (response to 1)

Posted by Neophytos Demetriou on 01/04/02 02:41 AM

I have been using openacs-4 with aolserver-3.3.1+ad13 (which AFAIK has internationalization support based on Mayoff and Minsky's patches) for about six months now and it works great. In my case aolserver is configured to convert content into utf-8 as soon as possible (when data is transferred from the client) and keep it in utf-8 in the database for as long as possible (until data is transferred to the client). Maintaining the data in utf-8 at server-side and transferring it back to the user is the easy part. The difficult part, however, is that you need to know the encoding of the text submitted through a text form (as Henry have already pointed out above). If you were using an ISO-8859 character set for transferring content to the client then I would not expect any serious problems (at least this is the case for ISO-8859-7, i.e. Greek). This is based on the fact that the user is more likely to submit her text in the same ISO-8859 character set as the one used when you transferred the page to it's browser. In that case aolserver will convert the data back to utf-8 as *required*. In your case though, you have to transfer the page to the client in utf-8 since you want a trully multilingual documents (instead of bilingual which is what you get with ISO-8859 character sets). Let my just say that I'm not an expert on this stuff and I would appreciate if Henry or anybody else could verify this information.

Also, have in mind that openfts-tcl *cannot* be used with utf-8 documents, as is. I got it working with utf-8 documents by modifying the parser and by only using the default dictionary (UnknownDict -- no stemming, no stopwords, exact matching). In order to make openfts handle utf-8 documents using more than one dictionary, i.e. using Porter's algorithm for English and a morphology-based dictionary for some other language, is a more complicated process and eventhough I have an idea of how it can be done I did not have the chance to try it yet.

8: Response to Dealing with non-Roman character sets (response to 1)

Posted by Bruno Mattarollo on 01/07/02 08:30 PM

This is a very interesting thread! :)

We, at Greenpeace, are working on an ACS classic 4.2 site. So we are using Oracle as the RDBMS (this solves most of the problems when dealing with utf-8 and multiple languages, since interMedia -in theory- should be able to index multilingual content and you should be able to perform complex search based on languages -if supported by Oracle-). We are using Henry's and Rob's patches to AOLServer and ACS and they work mostly fine.

As pointed already, you should take care in the part were you send information to a client (browser) or recieve from one.

The question you are asking, if I understand correctly, is "how do I display multiple languages in one page?". I would say that the only way to do that is to send and recieve in utf-8 but in that case the browser needs to support it and the user need to have installed the correct fonts. But do you really see pages having discussions in different languages/alphabets in the same page?

I am not 100% sure that you can't display multiple charsets in one page but if you are sending back to the browsers one charset in the HTTP headers, then you might run into trouble (Henry do you know what would happen in this case?)

Just my 2 cents.

9: Response to Dealing with non-Roman character sets (response to 1)

Posted by Daryl Biberdorf on 01/07/02 08:59 PM

Bruno, you are correct that I want the ability to mix languages within a page. It's fairly normal in theological discourse (since the advent of word processing, anyway), to freely intersperse original words into English (or German or...) discourse. When I want to talk about Tradition (in a Christian sense, for example), I'd rather refer to ????????? than paradosis.

I just did this doing Unicode, so we'll see. It doesn't work worth a flip in Netscape 4.7x. This is Netscape 6.2...

10: Response to Dealing with non-Roman character sets (response to 1)

Posted by Daryl Biberdorf on 01/07/02 08:59 PM

Well, clearly something more is needed.

11: Response to Dealing with non-Roman character sets (response to 1)

Posted by Henry Minsky on 01/07/02 09:59 PM

You can mix as many languages as you like using Unicode - it doesn't care, it has all written languages in the same "code space". The problem is fonts. You need a single font which includes all the glyphs in all the languages. There are some ugly Unicode fonts that have a lot of languages in them, but you'd have to be pretty desperate to use them. I think Internet Explorer has a pretty
basic Unicode font in it by default, at least if you try some of the UNicode test pages on the net, they display in many languages at the same time.

Beyond that you'd need an application that could specify fonts for each run of text in each language.

A friend of mine has been asking to get Japanese working again in Openacs 4, so I'll try to get that done in the next week (commit commit)

12: Response to Dealing with non-Roman character sets (response to 1)

Posted by Don Baccus on 01/07/02 10:13 PM

Daryl, it won't work here because this site's not been set up to deal with UTF-8. We may even be running under Tcl 7.6, which doesn't support it :)

13: Response to Dealing with non-Roman character sets (response to 1)

Posted by Daryl Biberdorf on 01/07/02 10:18 PM

Thanks for the heads-up, Don. Clearly I need to do some personal research. :)

14: Response to Dealing with non-Roman character sets (response to 1)

Posted by Don Baccus on 01/07/02 10:57 PM

Henry sez: (commit commit)

Don sez: if you (commit, commit) ... I'll give you CVS commit privs!