Forum OpenACS Development: PostgreSQL upper() / lower() functions and LATIN2 charset

I have problem with these functions.
I compiled PostgreSQL 7.1 RC1 with --enable-multibyte and --enable-unicode-conversion.

Then initdb -E UNICODE -D /usr/local/pgsql/data
Next I created database pl_test (createdb pl_test -E LATIN2).

When I try SELECT UPPER('some_text_with_polish_national_chars'); polish chars ars still in lower case but others (abcdef...wxyz) are in upper case.

If I compile PostgreSQL only with --enable-locale (locale = pl_PL), everything works OK. But if I want to use OpenACS, I MUST store everything in unicode.

Does anybody know any solution of this problem?

It sure sounds like a bug to me...I suggest you report it to the PG
group.  Though it is too late to get a bug fix into PG 7.1.0, PG 7.1.1
is already being discussed and I suspect you'd see a bugfix available
via CVS in relatively short order.
http://postgresql.lolix.org/devel-corner/docs/postgres/multibyte.htm

I'm just learning about how encodings are handled in
Postgres, but this document contains some useful information. In particular,
it looks like you may need to make sure your interactive pgsql client encoding
is set to your charset if you want to input 8 bit characters
through it.

SET CLIENT_ENCODING TO 'encoding';

To query the current frontend encoding:
SHOW CLIENT_ENCODING;

Funny. That is exactly what I have written to the pgsql-general
mailing list. No response, though :( And no improvement :(

While using the LATIN2-driven database, I have had no problems - the
problems start when using UTF-8. Even aften issuing the encoding
LATIN2 in psql I can't make UPPER() work. What's worse, it spits out
notices, saying that the backend can't convert some characters.

Basically, it is not the 'set client encoding ' thing's fault.
Something is obviously wrong (and I have tried every possible
combination of compilation/initdb/runtime parameters)

Try asking on the bugs list rather than general.  The core folk have
been very busy recently trying to get RC1 out the door, so you might
not get an immediate response.  If all else fails, I recommend you
e-mail Tom Lane directly.

Thanx for your responses but I think, I should explain my problem again.

I want to use (incoming) OpenACS 4.x with LATIN2 encoding.

In my opinion, I should set database encoding to UNICODE. So, I have to compile PostgreSQL 7.1 with --enable-multibyte and --enable-unicode-conversion (and, maybe, with --enable-locale).
Next initdb -E UNICODE -D /usr/local/pgsql/data .
Am I right?

As I wrote in my first post, upper() and lower() works improperly with these settings. But not only upper() and lower() works improperly. Also ORDER BY works improperly.

I test everything in this way:

  1. First, I create table
    CREATE TABLE names ( name varchar(20) );
  2. Next, I insert a few names via browser (encoded in ISO8859-2) to tcl script;
  3. The Tcl script encodes the names to UTF-8 (UNICODE) and sends them to PostgreSQL.
  4. Then, PostgreSQL sends results of query
    SELECT UPPER(name) AS fn FROM names ORDER BY fn;
    (encoded in UNICODE) to the Tcl script.
  5. The script encodes the results to ISO8859-2 and sends then to browser.

With this query it works improperly, but when I try with query SELECT name FROM names;, results are OK.

It looks PostgreSQL doesn't know, how to sort query results, so it use default sorting method (based on order of inserting records?).,

I'm not sure, it's PostgreSQL bug. Maybe it's problem with my locale settings (LC_ALL=pl_PL, LC_COLLATE=pl_PL , LC_CTYPE=pl_PL, LANG=pl_PL ) or with C library (glibc 2.1).


PS.
I have Debian GNU/Linux 2.2 "Potato" - Intel Celeron - kernel 2.2.19
PostgreSQL compiled with gcc 2.95.2

And therefore this is a postgres bug, not an openacs bug, and we can't
can't help you.  We don't fix postgres bugs.

You need to go talk to the postgres developers about this bug.  With
luck it will get fixed by PG 7.1.1 with a patch available earlier.

You stated the followng:
I test everything in this way:


First, I create table
CREATE TABLE names ( name varchar(20) );

Next, I insert a few names via browser (encoded in ISO8859-2) to tcl script; 
The Tcl script encodes the names to UTF-8 (UNICODE) and sends them to PostgreSQL. 
Then, PostgreSQL sends results of query
SELECT UPPER(name) AS fn FROM names ORDER BY fn;
(encoded in UNICODE) to the Tcl script. 
The script encodes the results to ISO8859-2 and sends then to browser. 
You have to be careful about what it means to submit LATIN-2 data from the browser to your AOLserver web server.

AOLserver *needs* to be told what character set your form data is in so that it can convert it to Unicode properly.

Your browser will *not* supply this information automatically. You set AOLserver to expect LATIN2 in the form data. There are about five different ways to do this. See my notes at http://imode.arsdigita.com/i18n or http://www.ai.mit.edu/people/hqm/openacs

If you just want to run your site in all LATIN2, you should get my AOLserver patches and apply them, and then make sure that in your .ini or .tcl server config file you have the lines

ns_section "ns/parameters"
        ns_param   HackContentType 1
        ns_param   URLCharset      shift_jis
        ns_param   OutputCharset   shift_jis
        ns_param   HttpOpenCharset shift_jis
URLCharset is the important one. My patches to AOLserver will look at this parameter and will use the specified encoding when reading posted form data.

You can also explcitly call the function ns_setformencoding before you call ns_getform.

Of course use
        ns_param   URLCharset      iso-8859-2
        
not shift_jis.

This command uses the MIME style name for the encoding. It is a little confusing in AOLserver because Tcl has its own slightly different names for charset encodings than HTTP/MIME does. There is a conversion table in the file nsd/conn8.c you can look it if you get confused.

Well, this doesn't seem to be an AOLServer issue. I (and as I recall, Robert too) have managed to get Tcl8.3 running with UTF/ISO mix of encodings - though having a working ADP support would be nice (I can't find a straightforward way to use 8859-2 in ADP-s. At present I use ad_parse_template, and I have thought of a filter that will do the same - return the headers with an appropriate ns_startcontent and parse the ADP file).

This brings a new question: I have noticed, that ad_parse_template used the fancy parser when invoked with -file, while the standard (adp) one while using -string?

I am downloading the bundled version of PG7.1RC2 to give it a test on my PLD-linux box (perhaps it was the issue of something broken with RH or Debian)

How did you get tcl8.3 running with 8859-2? Does your setup correctly store POSTed values in the database?

I would love to use nsd8x instead of nsd76, but have not yet seen a configuration where it does not convert special characters, e.g. german umlauts, into two weird characters, exactly as described here http://dqd.com/~mayoff/encoding-doc.html.

Converting the database to an encoding other than utf-8 would be acceptable for me too, since I only need to run my site with 8859-1 (for the german umlauts mainly).

I tried the patches from http://www.ai.mit.edu/people/hqm/openacs, but they did not work for me - the umlauts are still getting transformed into two weird characters.

Adding a Ns_Log call in the patched conn.c, in the function Ns_QueryToSet, right after the call to Ns_DecodeUrlCharset like:

  /* Use "ascii" to get a "raw" encoding (no translation) */
  decode = Ns_DecodeUrlCharset(&ds, value, "8bit");
  Ns_Log(Notice, "decoded value is: %s", value);

makes the two weird characters instead of the umlauts appear in the log file. When Ns_DecodeUrlCharset is called with ascii instead of 8bit the same result appears in the logfile. I know very little about neither C nor character encodings, so I have no idea what is going wrong here.

Hi!

I've been trying to get nsd8x with iso-8859-1 character set running. I've tried Henry Minsky's patches, read Henry's ASJ document, Rob Mayoff's encoding document, but with no sucess.

I'm including a short description of my system configuration. Could anybody with a working configuration please post a description of his configuration (dabatase encoding, how they compiled PostgreSQL, system encoding...)

  1. nsd.tcl file:
        ns_param   HackContentType   true
        ns_param   OutputCharset     iso-8859-1
        ns_param   PreferredCharsets iso-8859-1
        ns_param   URLCharset        iso-8859-1
    
  2. System encoding (as read in /usr/local/aolserver/modules/tcl/init.tcl)
        encoding system utf-8 (but i've also tried with iso8859-1)
    
  3. Database:
        Compiled with:
        $ ./configure --enable-locale --enable-multibyte=UNICODE --enable-unicode-conversion
     
        $ createdb -E UNICODE legolasdb
        $ psql -l
                List of databases
         Database  |  Owner   | Encoding
        -----------+----------+----------
         legolasdb | nsadmin  | UNICODE
    
  4. form.tcl
        set page "<form method=get action=form-2.tcl>
        <input type=text name=test>
        <input type=hidden name=_charset value=iso-8859-1>
        <input type=submit>
        </form>"
     
        ReturnHeaders "text/html; charset=iso-8859-1"
        ns_startcontent -type text/html
        ns_write $page
    
  5. form-2.tcl
        # I've tried these three:
        ns_formfieldcharset _charset
        # ns_urlcharset iso-8859-1
        # ns_setformencoding iso-8859-1
     
        set form [ns_getform]
        set test [ns_set get $form test]
     
        set db [ns_db gethandle]
        ns_db dml $db "insert into encoding (id, test)
          values (nextval('encoding_id_seq'), '$test')"
    
  6. Now, if i do a select id, test, upper(test) as upper, lower(test) as lower from encoding i get:
    id - test - upper - lower
    --------------------------
    47 áéíóú   áéíóú   ã¡ã©ã­ã³ãº
    
  7. And i thinks it's not a PostgreSQL issue, because in psql i can do:
    legolasdb=# show client_encoding;
    NOTICE:  Current client encoding is UNICODE
    SHOW VARIABLE
    legolasdb=# insert into zzz_encoding (id, test)
    legolasdb-#     values (nextval('zzz_encoding_id_seq'), 'áÃ?íÃ?ú');
    INSERT 184598 1
    legolasdb=# select id, test, upper(test), lower(test) from zzz_encoding;
    id |   test     |   upper    |   lower
    ----+------------+------------+------------
    48 | áÃ?íÃ?ú      | Ã?Ã?Ã?Ã?Ãs      | áéíóú
     
    
I'm using aolserver3_2_ad12 (with and without Henry's patches), and PostgreSQL 7.1rc4.

I'd be really grateful if someone could post his working configuration. Thanks in advance!

Sorry, I should have thought that the same problem with iso-8859-1 characters would arise in this site.

This is what i meant to write (this time using "&aacute;" expressions)

  • Now, if i do a select id, test, upper(test) as upper, lower(test) as lower from encoding i get:
    id - test - upper - lower
    --------------------------
    47 áéíóú   áéíóú    ã¡ã©ã­ã³ãº
    
  • And i thinks it's not a PostgreSQL issue, because in psql i can do:
    legolasdb=# show client_encoding;
    NOTICE:  Current client encoding is UNICODE
    SHOW VARIABLE
    legolasdb=# insert into zzz_encoding (id, test)
    legolasdb-#     values (nextval('zzz_encoding_id_seq'), 'áéíóú');
    INSERT 184598 1
    legolasdb=# select id, test, upper(test), lower(test) from zzz_encoding;
    id |   test     |   upper    |   lower
    ---+------------+------------+------------
    48 | áéíóú      | ÁÉÍÓÚ      | áéíóú 
    
So the problem is:
  • The test string comes OK from the database.
  • upper(test) doesn't returns uppercase chars.
  • lower(test) returns garbage.
  • With psql, everything works fine.

Well, I finally made it!

The confict was caused by system using ISO locale, while the db
thought it was unicode. The solution was to create the database WITH
ISO ENCODING and set the CLIENT_ENCODING variable to UNICODE.

The only thing left is to set PGCLIENTENCODING environment variable
to UNICODE and enjoy.

BTW: A question for Don - Is there a configuration parameter letting
you specify the client encoding for postgres (in the driver)?

No, there's not a driver directive to do that but it seems like a good  idea.  Is this implemented via a PSQL "SET" command?  If so, it is easy to incorporate into the driver, that's how the date format directive is implemented, for instance.
Yes, I think it may be done by using SET CLIENT ENCODING TO UNICODE after opening the connection. At the moment I am doing it in  another way - through a PGCLIENTENCODING environment variable (though I am not sure if it will work correctly all the time, so SET CLIENT ENCODING would be the preferred option to use).
Thanks, I'll try to find time to look into this before too much time goes by ...

If anyone else wants to investigate and submit a patch to me, that would be fine, too (can you tell I'm busy?)

Hi Grzegorz! I've the same problems with upper() and lower() functions as you have (had?), but with german umlauts.

I configured postgres with ./configure
--with-java
--enable-locale
--enable-unicode-conversion
--enable-multibyte

Then i created a database with
create database test with encoding='LATIN2' changed to test, did encoding unicode.

Wenn I try to insert umlauts I get the following error:
utf_to_latin: could not convert UTF-8

Did I miss a configure option?

Any help is _highly_ appreciated

No. When creating database, you should use LATIN1 encoding - LATIN2 is
Central European (for most Slavic countries), while LATIN1 is
appropriate for Germany, France etc.

so it should be:
CREATE DATABASE test WITH ENCODING='LATIN1';

in psql, you should use LATIN1 encoding. You should use the UNICODE
client encoding inside AOLServer - and that should do the trick (this
is done by setting PGCLIENTENCODING env. variable, though I asked Don
Baccus to add this function to the PG driver).

All this may not be neccessary, if using 2.2 glibc and 2.4 kernel
(though I haven't tested such configuration yet)

Hi Grzgeorz!

Of course you're right, I tested the whole thing with some different
encodings and latin2 was the last one, so it's in the posting.

But even with latin1 I can't get the upper() and lower() functions
to work properly, even in psql (with encoding set to latin1).

BTW, I've Kernel 2.4.4, glibc 2.2.