Forum OpenACS Development: PostgreSQL upper() / lower() functions and LATIN2 charset
I compiled PostgreSQL 7.1 RC1 with --enable-multibyte and --enable-unicode-conversion.
Then initdb -E UNICODE -D /usr/local/pgsql/data
Next I created database pl_test (createdb pl_test -E LATIN2).
When I try SELECT UPPER('some_text_with_polish_national_chars'); polish chars ars still in lower case but others (abcdef...wxyz) are in upper case.
If I compile PostgreSQL only with --enable-locale (locale = pl_PL), everything works OK. But if I want to use OpenACS, I MUST store everything in unicode.
Does anybody know any solution of this problem?
group. Though it is too late to get a bug fix into PG 7.1.0, PG 7.1.1
is already being discussed and I suspect you'd see a bugfix available
via CVS in relatively short order.
I'm just learning about how encodings are handled in
Postgres, but this document contains some useful information. In particular,
it looks like you may need to make sure your interactive pgsql client encoding
is set to your charset if you want to input 8 bit characters
SET CLIENT_ENCODING TO 'encoding';
To query the current frontend encoding:
mailing list. No response, though :( And no improvement :(
While using the LATIN2-driven database, I have had no problems - the
problems start when using UTF-8. Even aften issuing the encoding
LATIN2 in psql I can't make UPPER() work. What's worse, it spits out
notices, saying that the backend can't convert some characters.
Basically, it is not the 'set client encoding ' thing's fault.
Something is obviously wrong (and I have tried every possible
combination of compilation/initdb/runtime parameters)
been very busy recently trying to get RC1 out the door, so you might
not get an immediate response. If all else fails, I recommend you
e-mail Tom Lane directly.
Thanx for your responses but I think, I should explain my problem again.
I want to use (incoming) OpenACS 4.x with LATIN2 encoding.
In my opinion, I should set database encoding to UNICODE.
So, I have to compile PostgreSQL 7.1 with --enable-multibyte and
(and, maybe, with --enable-locale).
Next initdb -E UNICODE -D /usr/local/pgsql/data .
Am I right?
As I wrote in my first post, upper() and lower() works improperly with these settings. But not only upper() and lower() works improperly. Also ORDER BY works improperly.
I test everything in this way:
- First, I create table
CREATE TABLE names ( name varchar(20) );
- Next, I insert a few names via browser (encoded in ISO8859-2) to tcl script;
- The Tcl script encodes the names to UTF-8 (UNICODE) and sends them to PostgreSQL.
- Then, PostgreSQL sends results of query
SELECT UPPER(name) AS fn FROM names ORDER BY fn;
(encoded in UNICODE) to the Tcl script.
- The script encodes the results to ISO8859-2 and sends then to browser.
With this query it works improperly, but when I try with query SELECT name FROM names;, results are OK.
It looks PostgreSQL doesn't know, how to sort query results, so it use default sorting method (based on order of inserting records?).,
I'm not sure, it's PostgreSQL bug. Maybe it's problem with my locale settings (LC_ALL=pl_PL, LC_COLLATE=pl_PL , LC_CTYPE=pl_PL, LANG=pl_PL ) or with C library (glibc 2.1).
I have Debian GNU/Linux 2.2 "Potato" - Intel Celeron - kernel 2.2.19
PostgreSQL compiled with gcc 2.95.2
can't help you. We don't fix postgres bugs.
You need to go talk to the postgres developers about this bug. With
luck it will get fixed by PG 7.1.1 with a patch available earlier.
I test everything in this way: First, I create table CREATE TABLE names ( name varchar(20) ); Next, I insert a few names via browser (encoded in ISO8859-2) to tcl script; The Tcl script encodes the names to UTF-8 (UNICODE) and sends them to PostgreSQL. Then, PostgreSQL sends results of query SELECT UPPER(name) AS fn FROM names ORDER BY fn; (encoded in UNICODE) to the Tcl script. The script encodes the results to ISO8859-2 and sends then to browser.You have to be careful about what it means to submit LATIN-2 data from the browser to your AOLserver web server.
AOLserver *needs* to be told what character set your form data is in so that it can convert it to Unicode properly.
Your browser will *not* supply this information automatically. You set AOLserver to expect LATIN2 in the form data. There are about five different ways to do this. See my notes at http://imode.arsdigita.com/i18n or http://www.ai.mit.edu/people/hqm/openacs
If you just want to run your site in all LATIN2, you should get my AOLserver patches and apply them, and then make sure that in your .ini or .tcl server config file you have the lines
ns_section "ns/parameters" ns_param HackContentType 1 ns_param URLCharset shift_jis ns_param OutputCharset shift_jis ns_param HttpOpenCharset shift_jisURLCharset is the important one. My patches to AOLserver will look at this parameter and will use the specified encoding when reading posted form data.
You can also explcitly call the function ns_setformencoding before you call ns_getform.
ns_param URLCharset iso-8859-2not shift_jis.
This command uses the MIME style name for the encoding. It is a little confusing in AOLserver because Tcl has its own slightly different names for charset encodings than HTTP/MIME does. There is a conversion table in the file nsd/conn8.c you can look it if you get confused.
This brings a new question: I have noticed, that ad_parse_template used the fancy parser when invoked with -file, while the standard (adp) one while using -string?
I am downloading the bundled version of PG7.1RC2 to give it a test on my PLD-linux box (perhaps it was the issue of something broken with RH or Debian)
I would love to use nsd8x instead of nsd76, but have not yet seen a configuration where it does not convert special characters, e.g. german umlauts, into two weird characters, exactly as described here http://dqd.com/~mayoff/encoding-doc.html.
Converting the database to an encoding other than utf-8 would be acceptable for me too, since I only need to run my site with 8859-1 (for the german umlauts mainly).
I tried the patches from http://www.ai.mit.edu/people/hqm/openacs, but they did not work for me - the umlauts are still getting transformed into two weird characters.
Adding a Ns_Log call in the patched conn.c, in the function Ns_QueryToSet, right after the call to Ns_DecodeUrlCharset like:
/* Use "ascii" to get a "raw" encoding (no translation) */ decode = Ns_DecodeUrlCharset(&ds, value, "8bit"); Ns_Log(Notice, "decoded value is: %s", value);
makes the two weird characters instead of the umlauts appear in the log
file. When Ns_DecodeUrlCharset is called with
8bit the same result appears in the logfile. I know very
little about neither C nor character encodings, so I have no idea what
is going wrong here.
I've been trying to get nsd8x with iso-8859-1 character set running. I've tried Henry Minsky's patches, read Henry's ASJ document, Rob Mayoff's encoding document, but with no sucess.
I'm including a short description of my system configuration. Could anybody with a working configuration please post a description of his configuration (dabatase encoding, how they compiled PostgreSQL, system encoding...)
- nsd.tcl file:
ns_param HackContentType true ns_param OutputCharset iso-8859-1 ns_param PreferredCharsets iso-8859-1 ns_param URLCharset iso-8859-1
- System encoding (as read in /usr/local/aolserver/modules/tcl/init.tcl)
encoding system utf-8 (but i've also tried with iso8859-1)
Compiled with: $ ./configure --enable-locale --enable-multibyte=UNICODE --enable-unicode-conversion $ createdb -E UNICODE legolasdb $ psql -l List of databases Database | Owner | Encoding -----------+----------+---------- legolasdb | nsadmin | UNICODE
set page "<form method=get action=form-2.tcl> <input type=text name=test> <input type=hidden name=_charset value=iso-8859-1> <input type=submit> </form>" ReturnHeaders "text/html; charset=iso-8859-1" ns_startcontent -type text/html ns_write $page
# I've tried these three: ns_formfieldcharset _charset # ns_urlcharset iso-8859-1 # ns_setformencoding iso-8859-1 set form [ns_getform] set test [ns_set get $form test] set db [ns_db gethandle] ns_db dml $db "insert into encoding (id, test) values (nextval('encoding_id_seq'), '$test')"
- Now, if i do a
select id, test, upper(test) as upper, lower(test) as lower from encodingi get:
id - test - upper - lower -------------------------- 47 áéíóú áéíóú ã¡ã©ãã³ãº
- And i thinks it's not a PostgreSQL issue, because in psql i can do:
legolasdb=# show client_encoding; NOTICE: Current client encoding is UNICODE SHOW VARIABLE legolasdb=# insert into zzz_encoding (id, test) legolasdb-# values (nextval('zzz_encoding_id_seq'), 'áÃ?íÃ?ú'); INSERT 184598 1 legolasdb=# select id, test, upper(test), lower(test) from zzz_encoding; id | test | upper | lower ----+------------+------------+------------ 48 | áÃ?íÃ?ú | Ã?Ã?Ã?Ã?Ãs | áéíóú
aolserver3_2_ad12(with and without Henry's patches), and
I'd be really grateful if someone could post his working configuration. Thanks in advance!
This is what i meant to write (this time using "á" expressions)
So the problem is:
- Now, if i do a
select id, test, upper(test) as upper, lower(test) as lower from encodingi get:id - test - upper - lower -------------------------- 47 áéíóú áéíóú ã¡ã©ãã³ãº
- And i thinks it's not a PostgreSQL issue, because in psql i can do:legolasdb=# show client_encoding; NOTICE: Current client encoding is UNICODE SHOW VARIABLE legolasdb=# insert into zzz_encoding (id, test) legolasdb-# values (nextval('zzz_encoding_id_seq'), 'áéíóú'); INSERT 184598 1 legolasdb=# select id, test, upper(test), lower(test) from zzz_encoding; id | test | upper | lower ---+------------+------------+------------ 48 | áéíóú | ÁÉÍÓÚ | áéíóú
- The test string comes OK from the database.
- upper(test) doesn't returns uppercase chars.
- lower(test) returns garbage.
- With psql, everything works fine.
The confict was caused by system using ISO locale, while the db
thought it was unicode. The solution was to create the database WITH
ISO ENCODING and set the CLIENT_ENCODING variable to UNICODE.
The only thing left is to set PGCLIENTENCODING environment variable
to UNICODE and enjoy.
BTW: A question for Don - Is there a configuration parameter letting
you specify the client encoding for postgres (in the driver)?
If anyone else wants to investigate and submit a patch to me, that would be fine, too (can you tell I'm busy?)
I configured postgres with
Then i created a database with
create database test with encoding='LATIN2' changed to test, did encoding unicode.
Wenn I try to insert umlauts I get the following error:
utf_to_latin: could not convert UTF-8
Did I miss a configure option?
Any help is _highly_ appreciated
Central European (for most Slavic countries), while LATIN1 is
appropriate for Germany, France etc.
so it should be:
CREATE DATABASE test WITH ENCODING='LATIN1';
in psql, you should use LATIN1 encoding. You should use the UNICODE
client encoding inside AOLServer - and that should do the trick (this
is done by setting PGCLIENTENCODING env. variable, though I asked Don
Baccus to add this function to the PG driver).
All this may not be neccessary, if using 2.2 glibc and 2.4 kernel
(though I haven't tested such configuration yet)
Of course you're right, I tested the whole thing with some different
encodings and latin2 was the last one, so it's in the posting.
But even with latin1 I can't get the upper() and lower() functions
to work properly, even in psql (with encoding set to latin1).
BTW, I've Kernel 2.4.4, glibc 2.2.