Forum OpenACS Q&A: Dealing with non-Roman character sets
I'm working on building a site for theological discussion, using OpenACS. Real theological discussion :) has need of words from languages that don't use Roman characters (e.g., Greek and Russian). As I hate transliteration (representing the characters from one language with Roman equivalents), I'm wondering how one would deal with mixed-language pages. That is, most of the page would be in English (or another Western language), while individual words and phrases might use Greek or Cyrillic characters. Has anyone done this?
Unicode has a character code for every written human language,
for the most part.
Now, getting the correct font and displaying correctly on your computer is another issue. Microsoft browsers can generally display multilingual data in Unicode. The most common encoding used
is UTF-8, a variable-length code which is quite compact for ASCII
and western languages.
Check out http://www.geocities.com/i18nguy/unicode-example.html for
Getting ACS to properly encode data in Unicode requires a little bit
of setup. Are you using OpenACS 3 or OpenACS 4?
I just started running OpenACS4 a few weeks ago. I had some patches for
Arsdigita's ACS 4 to use Japanese and so forth by using UNicode internally in the database, and providing hooks to AOLserver to
tell it what character set encoding to accept and emit (if you are
letting people enter text via forms, you may need to know what
encoding the text is in, which is somewhat f*cked because of
lack of standards amongst browsers).
Anyway, I have some patches for ACS 3.2.5 that will let you do
Unicode, and I will try to work up an official set of changes that
could go into OpenACS4. That will require a little bit of effort, but
most of the groundwork is there, AOLserver and Postgres both handle Unicode (and so does Oracle) it's just a few lines of code here
and there in ACS to set the encodings when pages are generated or
forms variables are read from browsers.
Working with utf-8 requires careful attention to webserver, database, environment variable and (when debugging) terminal settings. I have had success on client sites with Oracle; haven't tried with PG.
Now that I've rewritten the hierarchical query/tree sortkey stuff to use BIT VARYING rather than text, not only should UTF-8 work but you should be able to set your locale and use the proper collation sequence for your language of choice. If you're using PG just remember that you need to do this before you do your post-install INITDB.
Hmmm...this works great for, say, an all-Japanese site but not so well for a mixed-language site. The PG folks have been talking about possibly allowing a little more flexibility in the future.
Anyway ... Henry, if you do work up a set of patches for OpenACS 4 and can test them for both Oracle and PG people would really appreciate it. If you could do so in the next couple of weeks we'd appreciate it even more because that's my rough notion as to when a beta release might be cut. A very rough notion as I've not discussed the documentation schedule with Roberto, yet.
I'm speaking from ignorance here... Just curious.
I'm not sure what the distros are doing regarding PG RPMs (or other packages) shipped and installed by default. I've not tried to follow this. We might want to follow their lead. But I think we'd want Henry's patches first ??? (if "their lead" means "use UTF-8"?)
Tcl 8 is always UTF-8 - that's what broke ns_uuencode, for instance.
Also, have in mind that openfts-tcl *cannot* be used with utf-8 documents, as is. I got it working with utf-8 documents by modifying the parser and by only using the default dictionary (UnknownDict -- no stemming, no stopwords, exact matching). In order to make openfts handle utf-8 documents using more than one dictionary, i.e. using Porter's algorithm for English and a morphology-based dictionary for some other language, is a more complicated process and eventhough I have an idea of how it can be done I did not have the chance to try it yet.
This is a very interesting thread! :)
We, at Greenpeace, are working on an ACS classic 4.2 site. So we are using Oracle as the RDBMS (this solves most of the problems when dealing with utf-8 and multiple languages, since interMedia -in theory- should be able to index multilingual content and you should be able to perform complex search based on languages -if supported by Oracle-). We are using Henry's and Rob's patches to AOLServer and ACS and they work mostly fine.
As pointed already, you should take care in the part were you send information to a client (browser) or recieve from one.
The question you are asking, if I understand correctly, is "how do I display multiple languages in one page?". I would say that the only way to do that is to send and recieve in utf-8 but in that case the browser needs to support it and the user need to have installed the correct fonts. But do you really see pages having discussions in different languages/alphabets in the same page?
I am not 100% sure that you can't display multiple charsets in one page but if you are sending back to the browsers one charset in the HTTP headers, then you might run into trouble (Henry do you know what would happen in this case?)
Just my 2 cents.
Bruno, you are correct that I want the ability to mix languages within a page. It's fairly normal in theological discourse (since the advent of word processing, anyway), to freely intersperse original words into English (or German or...) discourse. When I want to talk about Tradition (in a Christian sense, for example), I'd rather refer to ????????? than paradosis.
I just did this doing Unicode, so we'll see. It doesn't work worth a flip in Netscape 4.7x. This is Netscape 6.2...
basic Unicode font in it by default, at least if you try some of the UNicode test pages on the net, they display in many languages at the same time.
Beyond that you'd need an application that could specify fonts for each run of text in each language.
A friend of mine has been asking to get Japanese working again in Openacs 4, so I'll try to get that done in the next week (commit commit)