Forum OpenACS Q&A: Asian Characters and UTF-8 Encoding - Any experiences?

Request notifications

Hi,

an Asian guy has just hacked some of their double-byte character letters into P/O (AOLServer 3.3oacs, OpenACS 5.1.0 and Oracle 8i with USASCII encoding), and the usual mess appeared. Does somebody of you have made any experience? Might be interesting for .Lrn as well.

I've checked already AOLServer and TCL, and both should be fine (according to documentation) with UTF-8...

Thanks a lot in advance,
Frank

mailto:frank_dot_bergmann_at_project_dash_open_dot_com
http://www.project-open.com/

Collapse
Posted by Brian Fenton on
Hi Frank,
I've got some experience with character set problems (not with Asian characters but I'm sure the principles apply). You say that Oracle is using US7ASCII encoding - that definitely doesn't sound good. You should change it to UTF-8. Run this query to check the character set:

select * from NLS_DATABASE_PARAMETERS
where parameter = 'NLS_CHARACTERSET';

You should see the following if it's UTF-8:
PARAMETER                      VALUE
------------------------- ------------------------------
NLS_CHARACTERSET              UTF8

If, as you say, your character set is not UTF-8, you have a bit of work ahead of you. I doubt your Asian characters were entered correctly, so you'll need to clean out your data, convert your database to UTF8 and re-enter the data.

Also put the following settings in AOLserver's configuration file (in the ns_section ns/parameters section):

ns_param  HackContentType    1
ns_param  DefaultCharset    utf-8
ns_param  HttpOpenCharset    utf-8
ns_param  OutputCharset      utf-8
ns_param  URLCharset        utf-8

Add this line to the AOLserver wrapper script to set your environment variable:
set NLS_LANG=.UTF8

Here's an excellent resource from Oracle
http://www.oracle.com/technology/products/oracle8i/htdocs/faq_combined.htm

Hope this helps,
Brian

Collapse
Posted by Brian Fenton on
PS here's the Oracle note on how to convert your database character set:

http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_id=66320.1&p_database_id=NOT

Quick summary: alter database character set UTF8;

Beware the dreaded ORA-12712: new character set must be a superset of old character set!

As I said your data is probably screwed, so there's no point running the conversion until you clean it up.

Collapse
Posted by Frank Bergmann on
Thanks a lot, Brian!

That sounds excellent. To conclude your statement: There is no other nasty issue with double-byte characters (like: Oracle driver implementation, screwed AOLServer implementation, ...) except for the Oracle charset.

And yes, our default Oracle charset is still US7ASCII, following a recomendation of Philip of around 1999...

Bests,
Frank

Collapse
Posted by Brian Fenton on
I've certainly had no problems with double-byte characters using a UTF-8 database, but I've only been using European letters. Maybe somebody else can confirm that there are no issues with Asian characters.

The latest version of the OpenACS install docs recommends a Character Set of UTF8:

http://openacs.org/doc/current/oracle.html

Best wishes,
Brian

Collapse
Posted by Andrew Piskorski on
(Not that it's relevant, but I rather doubt that Philip G. ever recommended ASCII7 for the Oracle character set, even in 1999.)

The fact is that at least through Oracle 8.1.7.x, ASCII7 is the default. Oracle also stupidly hides this in their installer, so unless you specifically look for it when installing Oracle, you will almost certainly end up with with an ASCI7 rather than UTF8 database - bad. My old outdated Oracle install notes have some more info on that UTF8 character set stuff.

Collapse
Posted by Nis Jørgensen on
At Greenpeace we are serving Arabic and Hebrew pages using UTF-8, Russian using iso-8859-6 and Chinese using big5 and EUC-CN*. All this from one AOLServer setup (hacking the content type header)

Using AOLServer 3.3oacs, a heavily modified OpenACS 4.6.3 and Oracle 8.1.7 with UTF-8 encoding

/Nis

* Note: The EUC-CN encoding is commonly referred to as GB2312 - but the two are different encodings of the same character set, and EUC-CN is the one everyone is using.

Collapse
Posted by Frank Bergmann on
Ni Nis,

thanks a lot for your comment. May I ask you why you are using different encodings for Russian and Chinese?

Bests,
Frank

Nis,

I have some issues with displaying German characters in an old 4.6.2 through ad33.13. Any text returned from PostgreSQL displays properly but any UTF-8 text in adp pages or pulled from the file system simply displays question marks in browsers.

I have tried adding a meta http-equiv tag to set the content-type charset parameter to UTF-8 and also have tried setting it to iso8859-1 but this makes absolutely no difference.

I wonder if you would give me some guidance on how to hack the content type header from aolserver. Also does the ns_param OutputCharset parameter work for ad33.13 or is this a parameter for a later version of Aolserver?

Many Thanks

Regards
Richard

Richard, in your AOLserver config file, do you perhaps have some settings like these?:

ns_section ns/parameters
   ns_param OutputCharset iso-8859-1
   ns_param HackContentType 1

ns_section ns/MimeTypes
   set mime_plain {text/plain; charset=iso-8859-1}
   set mime_html  {text/html; charset=iso-8859-1}

   # See also "http://dqd.com/~mayoff/encoding-doc.html" for advice on
   # character sets and MIME types in AOLserver.

   ns_param Default     $mime_plain
   ns_param NoExtension $mime_plain
   ns_param .txt  $mime_plain
   ns_param .text $mime_plain
   ns_param .htm  $mime_html
   ns_param .html $mime_html

That's what I use in AOLserver 4.0.10, but note that I am purposely serving only iso-8859-1 content. If you are trying to serve UTF-8, some of the settings above would probably break stuff for you.

Talk about serendipity ... I randomly went to look at openacs.org for the first time for months - and found this (I did not have notifications on for this thread - I believe there should be a way to set forums up to always do that).

Anyway, I don't think your problem is the same that we solved. All our adp files were[1] in plain ASCII - the big trick was to make AOLServer do the "correct" conversion of the generated page (Unicode -> local encoding)

It sounds to me like AOLServer makes wrong assumptions about the charset of your files. Not sure how to handle that.

[1] We are now running a new OACS-based CMS, serving everything as utf-8.

Hi All,

Well I was searching for Unicode to Arabic conversions and came across this post.

What I am trying to do is convert some unicode data that I have into Arabic, there is a function written that does the conversion using a sequence of Case stmts and it searches for the unicode chars and then replaces it with the corresponding Arabic character.

e.g. IF strT = '067E' THEN Dest := Dest || Chr(129 USING NCHAR_CS); --129Arabic Peh

Well the function works correctly in Oracle 9i and above but refuses to compile in versions of Oracle below 9i.

Here are some more details:
Oracle Version: Oracle 8.1.7
Compilation error: PLS 00561- Characterset mismatch on value for parameter 'Right'.

I am assuming it is the Nchar_cs that creates the problem.

Would appreciate any help on this. Thx in Advance

Regards
Chetz

I have recycled the parts of my brain I used to store Oracle knowledge. Even then, I don't think I ever worked with non-standard character sets in it - only ASCII and some Oracle version of unicode.

My suggestion would be to do any transformations in tcl, rather than in Oracle. In fact I would suggest getting rid of Oracle and switch to Postgres. We did, and I haven't had a single moment of regret.

/Nis

Collapse
Posted by Evica Ilieva on
Hi to all
I have a problem. I connect Oracle with php. It works, but it has two problems. ( I have two problems).
It withdrows all the columns twice, and doesnt show the data which is put in the data base with cyrilic encoding. THe encoding of the data base is UTF-8. Also I tried to set the encodung in HTML - PHP code UTF-8,but it doesn't work.
I have the same problem with mySql. But I find SQL query which soles this problem. the code is "mysql_query("set names cp 1251"); ", which is inserted right after the code for connection with the databese.
I try this query with oracle, but it doesn't work.

Please help!

Evica

Collapse
Posted by brad chick on
I am having trouble getting Chinese characters in/out of oracle with AOLserver. Here is my stack:

Oracle 11g
TCL 8.5
AOLserver 4.51
Oracle Driver version 2.7

The existing database has a database character set of WE8ISO8859P1.
But we are using NCHAR and NVARCHAR2 datatypes to store Unicode characters. The
NLS_NCHAR_CHARACTERSET is properly set to AL16UTF16.

I am setting the following in the environment as an nsd wrapper:

export NLS_LANG=_.UTF8

I can login to sqlplus and insert and select chinese characters:

insert into test_zhs (foo) values (N'男孩儿男孩儿');

SQL> select * from test_zhs;

FOO
--------------------------------------------------------------------------------
男孩儿男孩儿

I also am sure that aolserver/tcl are treating the characters appropriately.
For example, this form takes whatever characters are inputed into the form, tries to insert them, and spits them back out.
When chinese characters are inputed, that is how the server returns them.
http://jp.xacte.com:8181/test/db/myform.tcl

On the other hand, no matter what I try, I can't get aolserver to get them into oracle properly. I suspect it's the oracle
driver, but people have suggested that it is possible to put unicode characters into oracle using that driver.

Any help would be way helpful.

Thanks

Collapse
Posted by Brian Fenton on
Hey Brad

I've never used the NCHAR and NVARCHAR stuff, so maybe that is the way to go, but I understood that you might run into trouble having your database character set as WE8ISO8859P1. It might be worth running a quick test with a new UTF8 database, just to rule that out.

I presume you tried the AOLserver config file settings described above?

Does it work if you hardcode some Chinese characters directly into an INSERT statement in a TCL proc?

Brian