Forum OpenACS Q&A: howto run 3.2.5 with European characters on nsd8x

I spend the last couple hours reading through most of the threads
concerning iso 8859-1 encoding + nsd8x on OpenACS 3.2.5 and found out
that quite some people have problems with it while solutions
are not really documented!

I tried to get nsd8x running couple month ago, but didn't really
make it back then... Does anyone have a working 3.2.x site on nsd8x?
Which steps would I have to make in order to change my current nsd76
system (pg encoding=SQL_ASCII) to nsd8x???

Collapse
2: Please, give us answers! (response to 1)
Posted by Rocael Hernández Rizzardini on
Yes, I had this problems since I started to work with openacs, I know there are some ways to handle this, but none are documented!, I guess this is a PRIORITY step in order to make openacs go universally accepted. Remember, not all the people speaks english.
I've been running a website with OpenACS 3.2.4 on nsd8x with iso 8859-2 for some time now. It is not the same encoding you are going to use but ... why not give it a try replacing 8859-2 with 8859-1 along the way?

Database setup: the OACS database in PostgreSQL was created with

 createdb -E LATIN2 yourdb 
so it stores stuff with single-byte encoding internally. Since tcl8x handles strings in unicode, I added this to /etc/profile:
PGCLIENTENCODING='UNICODE' ; export PGCLIENTENCODING
So, the string conversion between PostgreSQL and Aolserver is done smoothly by the first one.

Now, to get iso 8859-2 output from Aolserver, you need to add a few lines to your nsd.tcl:

 ns_section "ns/parameters"
        ns_param   OutputCharset iso-8859-2
        ns_param   UrlCharset iso-8859-2
        ns_param   HttpOpenCharset iso-8859-2

 ns_section "ns/mimetypes"
    ns_param ".html" "text/html; charset=iso-8859-2"
    ns_param ".txt" "text/plain; charset=iso-8859-2"
    ns_param ".adp" "text/plain; charset=iso-8859-2"

and
 encoding system iso8859-2
at the end of the file. That basically does the trick for output. To make input work, you should edit form.tcl (from aolserver/modules/tcl), find this line:
 ns_set put $form $name $value 
and replace it with
 ns_set put $form $name [encoding convertfrom iso8859-2 $value]

Well, that is it, more or less. Two issues remain: getting the right charset in outgoing emails generated by OpenACS and correct ADP parsing. I am not sure how I solved these - IIRC the trick with ADP is to use another parser (from the two available) if you do not get good results from the start. With email - you need to modify sendmail.tcl and add to headers three lines describing the charset, in my case it is:

MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-2
Content-Transfer-Encoding: 8bit

As for my setup, I am running Aolserver 3.2ad12 and PostgreSQL 7.1.3 (configured with --enable-unicode-conversion --enable-locale --enable-multibyte=UNICODE)

David, let me know if that works for you. If it does, I will look further into my code - I remember there was some additional tweaking needed to make OACS ADP templates output correct charset as well.

I have some bad notes and patches to make ACS 3.2.5 work with
Japanese. Will work for ISO-XXX as well, but the patches
are hardcoded for shiftjis. You can find them at

http://www.ai.mit.edu/people/hqm/openacs

I have contributed some patches for OpenACS4 so that configuring
the system for any specific default encoding will just work
from the .tcl config file. Dunno if DonB got them merged in, but
they are sitting in a patch file in the acs-lang module directory
also.

I made things work by compiling postgres with unicode support,and
building the database with unicode enabled, i.e.,

./configure --enable-locale --enable-recode --enable-multibyte
            --enable-unicode-conversion
            --with-maxbackends=64 --with-tcl
            --with-perl --with-openssl --with-CXX --enable-syslog

(you don't need all that crap, but that's how my machine
happened to be config''d)

createdb --encoding=unicode openacs-4

Czesc,

1.) I have it almost working, but while Marcin is proposing to create the db with LATIN2 (in my case LATIN1), Henry is proposing UNICODE...

createdb --encoding=unicode yourdb

createdb -E LATIN2 yourdb

Which should I take? (pros - cons...)

2.) I read some threads about occurring problems with improper sorting using upper() or lower(). Is this correct? If yes, what can I do about it? I read something about Lang=C ??!!??

3.) Will there be a problem importing my existing database (sql-ascii), which is not compiled with unicode-conversion, into the newly compiled postgres with unicode-conversion + unicode or Latin1??

Thanks

By the way... How can I check which way I compiled my existing version of PG 7.1.3. I believe that I didn't compile it with --enable-unicode-conversion...
Okay, I am almost there:

1.) In order to make ns_write return the right characters I had to add the following line to the end of the proc "ReturnHeaders":

ns_startcontent -charset "iso-8859-1"

(ns_return works without this patch)

2.) In oder to use ns_startcontent you will have to install AOLServer 3.3.1ad13

3.) I compiled PG7.1.3 like this:

./configure --enable-locale --enable-recode --enable-multibyte --enable-unicode-conversion --with-tcl --with-perl --with-CXX --enable-syslog

4.) When you are using AOLserver 3.1.1ad13 you don't have to put the line "encoding system iso8859-1" at the end of the nsd.tcl file, because there is an entry in home/aolserver/modules/tcl/init.tcl already:

encoding system [ns_config ns/server/[ns_info server] SystemEncoding iso8859-1]

It might have been set to utf-8 originally... I don't really remember... so just check init.tcl

5.) I created my database with:

createdb --encoding=unicode yourdb

---------------------------------------------------------------------

What works now:

1.) Everything is being displayed correctly in a browser (.adp / .tcl / .txt / .html) !!!

What doesn't seem to work properly:

1.) When I create a database with:

createdb --encoding=LATIN1 yourdb

it looks like everything is beeing imported correctly, but when I do a "select * from users;" 0 rows are being returned???!!!

The db is imported correctly when I created the db with:

createdb --encoding=unicode yourdb

2.) When I udpate users.first_names with a German character like "ä" the result is being displayed correctly in the browser. But when I check the database from a shell via "psql mydb" the just added German character is being displayed weirdly... The shell displays the German characters that got imported from my old database correctly though!!!

3.) The same thing happens when I update a file from the file-storage via file-manager... When I edit the file from a browser all the special characters are being displayed correctly. But when I open the file from the shell using "vim" the special characters are being displayed weirdly!!!

4.) Community members' pictuers are not displayed by my browser anymore.. (They are stored in the database and not in the file-storage)

Okay I have everything but one thing working now. The only thing that doesn't really work is displaying pictures that are saved in the database as a blob...

What I did:

1.) Installed aolserver 3.3.1ad13

2.) Put the following lines in nsd.tcl
ns_section "ns/parameters"
        ns_param   HackContentType      true
        ns_param   URLCharset           iso-8859-1
        ns_param   OutputCharset        iso-8859-1
        ns_param   HttpOpenCharset      iso-8859-1

ns_section "ns/mimetypes"
        ns_param        ".html"   "text/html; charset=iso-8859-1"
        ns_param        ".tcl"    "text/html; charset=iso-8859-1"
        ns_param        ".htm"    "text/html; charset=iso-8859-1"
        ns_param        ".adp"    "text/plain; charset=iso-8859-1"
        ns_param        ".txt"    "text/plain; charset=iso-8859-1"

3.) Put the following line in /home/aolserver/modules/tcl/init.tcl
encoding system [ns_config ns/server/[ns_info server] SystemEncoding iso8859-1]

4.) Replaced line 291 /home/aolserver/modules/tcl/form.tcl with the following line:
ns_set put $form $name [encoding convertfrom iso8859-1 $value]

5.) Added the following line to the proc ReturnHeaders in tcl/ad-utilities.tcl.preload:
ns_startcontent -charset "iso-8859-1"

6.) Compiled Postgres 7.1.3 with:
./configure --enable-locale --enable-recode --enable-multibyte --enable-unicode-conversion --with-maxbackends=64 --with-tcl --with-perl --with-openssl --with-CXX --enable-syslog

7.) Initialized Postgres with:
initdb -E unicode

8.) Created my database with:
createdb mydb
(you don't have to use "-E unicode" anymore, because you initialized PG with "-E unicode")

9.) Added this to /etc/profile:
PGCLIENTENCODING='UNICODE'
 export PGCLIENTENCODING
10.) Here comes the sun now:
After I imported my old database (sql-ascii) I would see all the special characters displayed correctly in my browser. When I checked i.e. the users table in my database with "psql mydb" everything would display correctly, too!!!

After I changed some data from the users table with my browser and checked my database with "psql mydb" again, the newly updated line would show corrupted characters, while the untouched lines would display correctly!!!

Keep calm. Everything went technically correct. What I did then:

10.1.) Typed the following in my shell:
PGCLIENTENCODING='Latin1' ; export PGCLIENTENCODING

10.2.) Reentered my database:
psql mydb

The new line was displayed correctly this time, while PG had the following problem with lines that contained *imported* special characters:

ERROR: Could not convert UTF-8 to ISO8859-1

--------------------------------------------------------------------
Questions:
1.) Do I have to import my old database in a special way that this last ERROR doesn't show?

2.) What can I do about the blog pictures not being displayed?

Thanks to Marcin and Henry so far!!!
If binary images are being corrupted, it sounds as if
AOLserver is trying to do some charset conversion on
the binary data, like converting UTF8 to ISO-8859-1 or something.
How are you binary images generated? Can you post the
tcl code for the page that generates them?
If you're trying to read it in to a tcl string from a blob and then display it you will need the ns_returnbinary command from my binary support module at http://www.vorteon.com/download/. I think you could probably set up one of the files in your tcl directory to do a
rename ns_return ns_return_classic
rename ns_returnbinary ns_return
rename ns_write ns_write_classic
rename ns_writebinary ns_write

and use these commands all the time but I haven't tried that.
I don't know how these commands might interact with your other efforts.
Hello David,

I installed your module (had to get the rpms for libgcc_s.so.1 first though), but nothing changed!!!

Hello Henry,

I am using /shared/portrait-bits.tcl to display the pictures of users:
set db [ns_db gethandle]

set file_type [database_to_tcl_string_or_null $db "select portrait_file_type
from users
where user_id = $user_id"]

if [empty_string_p $file_type] {
    ad_return_error "Couldn't find portrait" "Couldn't find a portrait for User $user_id"
    return
}

ReturnHeaders $file_type


ns_pg blob_write $db $portrait_id
I had the same problem as David. In my case it depended from the fact that ReturnHeaders was been modified to handle Latin characters appending to it:
ns_startcontent -charset "iso-8859-1"
This arrangement doesn't work with images and so, to get portraits displayed I had to call a non modified version of ReturnHeaders.

This is clearly a temporary solution, because you have to find out the occurrences where you need to call the standard ReturnHeaders.

The weird thing is that my watchdog is showing the following error:
[05/Mar/2002:12:46:39]
    Error: can't read "portrait_id": no such variable
    can't read "portrait_id": no such variable
        while executing
    "ns_pg blob_write $db $portrait_id
    "
        (file "/web/server1/www/shared/portrait-bits.tcl" line 30)
        invoked from within
    "source $script"
        (procedure "ns_sourceproc" line 6)
        invoked from within
    "ns_sourceproc cns51 {}"
    Notice: Querying '
            select user_id, token, secure_token,
                   last_ip, last_hit from sec_sessions
            where session_id = 33244;'

Good to know that you found a way to make it work. As for the database issue - whether its encoding should be unicode or latin1 - it looks that it can be one or another, as long as strings are converted to/from unicode for aolserver correctly in case where database encoding were latin1/latin2. The PGCLIENTENCODING='UNICODE' in my /etc/profile cares for that.

As for myself I'd rather not have the database in unicode, despite the fact that it works for you. For once, sometimes I'm connecting to the database from a Windows box via ODBC and this client uses yet another encoding (windows1250), for which the conversion can be done by postgres, but only if latin2 encoding is used on the server's end.

The other reason is, I want to avoid problems similar to these you experienced: getting a mix of encodings within the database in case some strings are entered via forms, some from psql shell, some from dump files and I forget to set encoding info for each 'way' correctly. By setting database encoding to latin2 I can be sure at least the psql input is right and dump files can be checked visually before importing them into datbase. By the way, while in psql the encoding command comes very handy when database encoding differs from terminal encoding.

My third reason is that I am not yet aware of proper locale files for pl_PL (Polish) for Unicode. Without them I can not depend on Postgres with correct sorting, upper(), lower(), etc.

For a good reference on encodings in PostgreSQL have a look at: http://postgresql.org/users-lounge/docs/7.1/admin/multibyte.html

You may want to do one thing: before importing a latin1-encoded dump file into unicode-encoded database, insert a line containing

encoding latin1
at the beginning of the file.

> How can I check which way I compiled my existing version of PG 7.1.3?
Look towards the first lines of config.status file in the directory you put postgresql sources.

Not again the backslash syndrome :)
It should read encoding not encoding in my above post
Oops.  I wasn't aware of the existence of "ns_pg blob_write".  You won't
need my module after all.

It looks like the following line is missing from your portrait-bits.tcl file for
some reason.

set portrait_id [database_to_tcl_string $db "select lob from users where
user_id = $user_id"]

Marcin,

I just created a db with "createdb -E latin1" and imported my old (sql-ascii) database...

As I had my /etc/profile set to PGCLIENTENCODING='unicode' I first had to do a:

PGCLIENTENCODING='latin1' ; export PGCLIENTENCODING

before I could import the database. Otherwise errormsg occurred...

When I enter the database via shell psql and don't want to see corrupted special characters I also have to set:

PGCLIENTENCODING='latin1' ; export PGCLIENTENCODING

Does this sound correct and familiar???

Concerning the image problem, which is supposingly the last thing on the nsd8x todo list...

ns_startcontent -charset "iso-8859-1" is the culprit while displaying images from the db...

Any solution for this???

Incidentally, I just installed the aolserver 3.3+ad13 openacs beta download from openacs.org/software, and character set translations worked out of the box from oracle unicode -> ISO8859-1 on the browser side. Thank you!!!
hi/czesc

I'm in preparation for converting my system to iso-8859-2 in order to accomodate polish characters.
My question is, since I'm using OACS4.5 should I follow the same recomendations as posted here or ver4.5 has a different approach to changing character set.

janus

Janus, this thread was very useful when I was setting up PostgreSQL and AOLserver to be iso-8859-1 compatible for OpenACS 4.5 installations, in fact I only configured PostgreSQL 7.2 with
./configure --enable-locale --enable-multibyte
created databases with
createdb --encoding=LATIN1 somedbname
and made sure to have
PGCLIENTENCODING=UNICODE
in the environment of the unix user that starts AOLserver (but not the user that you enter psql as).