Forum OpenACS Q&A: Re: Asian Characters and UTF-8 Encoding - Any experiences?

Collapse
Posted by Brian Fenton on
Hi Frank,
I've got some experience with character set problems (not with Asian characters but I'm sure the principles apply). You say that Oracle is using US7ASCII encoding - that definitely doesn't sound good. You should change it to UTF-8. Run this query to check the character set:

select * from NLS_DATABASE_PARAMETERS
where parameter = 'NLS_CHARACTERSET';

You should see the following if it's UTF-8:
PARAMETER                      VALUE
------------------------- ------------------------------
NLS_CHARACTERSET              UTF8

If, as you say, your character set is not UTF-8, you have a bit of work ahead of you. I doubt your Asian characters were entered correctly, so you'll need to clean out your data, convert your database to UTF8 and re-enter the data.

Also put the following settings in AOLserver's configuration file (in the ns_section ns/parameters section):

ns_param  HackContentType    1
ns_param  DefaultCharset    utf-8
ns_param  HttpOpenCharset    utf-8
ns_param  OutputCharset      utf-8
ns_param  URLCharset        utf-8

Add this line to the AOLserver wrapper script to set your environment variable:
set NLS_LANG=.UTF8

Here's an excellent resource from Oracle
http://www.oracle.com/technology/products/oracle8i/htdocs/faq_combined.htm

Hope this helps,
Brian

Collapse
Posted by Brian Fenton on
PS here's the Oracle note on how to convert your database character set:

http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_id=66320.1&p_database_id=NOT

Quick summary: alter database character set UTF8;

Beware the dreaded ORA-12712: new character set must be a superset of old character set!

As I said your data is probably screwed, so there's no point running the conversion until you clean it up.

Collapse
Posted by Frank Bergmann on
Thanks a lot, Brian!

That sounds excellent. To conclude your statement: There is no other nasty issue with double-byte characters (like: Oracle driver implementation, screwed AOLServer implementation, ...) except for the Oracle charset.

And yes, our default Oracle charset is still US7ASCII, following a recomendation of Philip of around 1999...

Bests,
Frank

Collapse
Posted by Brian Fenton on
I've certainly had no problems with double-byte characters using a UTF-8 database, but I've only been using European letters. Maybe somebody else can confirm that there are no issues with Asian characters.

The latest version of the OpenACS install docs recommends a Character Set of UTF8:

http://openacs.org/doc/current/oracle.html

Best wishes,
Brian

Collapse
Posted by Andrew Piskorski on
(Not that it's relevant, but I rather doubt that Philip G. ever recommended ASCII7 for the Oracle character set, even in 1999.)

The fact is that at least through Oracle 8.1.7.x, ASCII7 is the default. Oracle also stupidly hides this in their installer, so unless you specifically look for it when installing Oracle, you will almost certainly end up with with an ASCI7 rather than UTF8 database - bad. My old outdated Oracle install notes have some more info on that UTF8 character set stuff.

Collapse
Posted by Nis Jørgensen on
At Greenpeace we are serving Arabic and Hebrew pages using UTF-8, Russian using iso-8859-6 and Chinese using big5 and EUC-CN*. All this from one AOLServer setup (hacking the content type header)

Using AOLServer 3.3oacs, a heavily modified OpenACS 4.6.3 and Oracle 8.1.7 with UTF-8 encoding

/Nis

* Note: The EUC-CN encoding is commonly referred to as GB2312 - but the two are different encodings of the same character set, and EUC-CN is the one everyone is using.

Collapse
Posted by Frank Bergmann on
Ni Nis,

thanks a lot for your comment. May I ask you why you are using different encodings for Russian and Chinese?

Bests,
Frank

Nis,

I have some issues with displaying German characters in an old 4.6.2 through ad33.13. Any text returned from PostgreSQL displays properly but any UTF-8 text in adp pages or pulled from the file system simply displays question marks in browsers.

I have tried adding a meta http-equiv tag to set the content-type charset parameter to UTF-8 and also have tried setting it to iso8859-1 but this makes absolutely no difference.

I wonder if you would give me some guidance on how to hack the content type header from aolserver. Also does the ns_param OutputCharset parameter work for ad33.13 or is this a parameter for a later version of Aolserver?

Many Thanks

Regards
Richard

Richard, in your AOLserver config file, do you perhaps have some settings like these?:

ns_section ns/parameters
   ns_param OutputCharset iso-8859-1
   ns_param HackContentType 1

ns_section ns/MimeTypes
   set mime_plain {text/plain; charset=iso-8859-1}
   set mime_html  {text/html; charset=iso-8859-1}

   # See also "http://dqd.com/~mayoff/encoding-doc.html" for advice on
   # character sets and MIME types in AOLserver.

   ns_param Default     $mime_plain
   ns_param NoExtension $mime_plain
   ns_param .txt  $mime_plain
   ns_param .text $mime_plain
   ns_param .htm  $mime_html
   ns_param .html $mime_html

That's what I use in AOLserver 4.0.10, but note that I am purposely serving only iso-8859-1 content. If you are trying to serve UTF-8, some of the settings above would probably break stuff for you.

Talk about serendipity ... I randomly went to look at openacs.org for the first time for months - and found this (I did not have notifications on for this thread - I believe there should be a way to set forums up to always do that).

Anyway, I don't think your problem is the same that we solved. All our adp files were[1] in plain ASCII - the big trick was to make AOLServer do the "correct" conversion of the generated page (Unicode -> local encoding)

It sounds to me like AOLServer makes wrong assumptions about the charset of your files. Not sure how to handle that.

[1] We are now running a new OACS-based CMS, serving everything as utf-8.

Hi All,

Well I was searching for Unicode to Arabic conversions and came across this post.

What I am trying to do is convert some unicode data that I have into Arabic, there is a function written that does the conversion using a sequence of Case stmts and it searches for the unicode chars and then replaces it with the corresponding Arabic character.

e.g. IF strT = '067E' THEN Dest := Dest || Chr(129 USING NCHAR_CS); --129Arabic Peh

Well the function works correctly in Oracle 9i and above but refuses to compile in versions of Oracle below 9i.

Here are some more details:
Oracle Version: Oracle 8.1.7
Compilation error: PLS 00561- Characterset mismatch on value for parameter 'Right'.

I am assuming it is the Nchar_cs that creates the problem.

Would appreciate any help on this. Thx in Advance

Regards
Chetz

I have recycled the parts of my brain I used to store Oracle knowledge. Even then, I don't think I ever worked with non-standard character sets in it - only ASCII and some Oracle version of unicode.

My suggestion would be to do any transformations in tcl, rather than in Oracle. In fact I would suggest getting rid of Oracle and switch to Postgres. We did, and I haven't had a single moment of regret.

/Nis