Forum OpenACS Q&A: Character Set Problem with web form input

I have a characterset problem reading german umlauts from web forms.
(Using OpenACS4 and PG7.1 with LATIN-1 Encoding)

These german umlauts are converted to strange characters as i can see 
from log file notices  ( ö becomes ö for example ). If written back 
to the browser these characters are converted back again and 
displayed correctly. But into the database ö and friends are 
written. 

This conversion happens only with web form input. Data read from text 
files gets successfully written to the database (as ö,ä ...). 

I have tried changing the [encoding system] but that did not help. 
The other Aol Server patches seemed to deal with writting back issues 
not with reading from forms. 

Interestingly ns_getform claims to use "iso8859-1 for charset iso-
8859-1" that would be correct, but then why are these strange 
characters generated ?

TIA

Collapse
Posted by Peter Alberer on
Ups i should have used the html entities for the german umlauts :(
Collapse
Posted by Don Baccus on
Have you compiled PG with multibyte and locale support?
Collapse
Posted by Peter Alberer on

I have not compiled Postgresql (psql -V tells me multibyte support is there), i am currently using an rpm version. As the support of the database for the german characters seems to work very well i thought that would not be necessary ?

Any Umlaut read from a text file is displayed correctly at any time (in psql, in the error-log written via ns_log, on the browser side)

But if i retrieve a value via form input, garbage is immediatly displayed when writing the value to the log:

set form_value [encoding convertfrom iso8859-1 [ns_set value $formdata $i]]
ns_log Notice $form_value --> Garbage (no database involved)

Collapse
Posted by Tilmann Singer on
You should also make sure that your OpenACS database is in unicode encoding. Doing a psql -l should tell you something like this:

<pre>
tils@tp:~$ psql -l
        List of databases
Database  |  Owner  | Encoding
-----------+----------+----------
oacs      | tils    | UNICODE

...
</pre>

About the weird characters in the error log: that's ok - propably because the terminal or whatever you use to look at the log does not understand unicode. Also psql will display those weird characters in an otherwise correct setup. There are some older threads here on this topic where this is explained much better.
<p>
Depending on your version of OpenACS and Aolserver you might need some of Henry Minski's patches to make form input work correctly.
<p>
I run a recent cvs checkout of OpenACS 4 and Aolserver 3.3ad13, and that combination seems to contain all necessary fixes already.

Collapse
Posted by Henry Minsky on
If you're using ISO-8859-1, then I think things will work by default,
but make sure that your database is configured for Unicode.

/usr/local/pgsql/bin/initdb  --encoding unicode  -D /usr/local/pgsql/data

createdb --encoding=unicode yourdbname

AOLserver internally uses UTF8 to talk to postgres, so I think
this is required. You might get away with ISO-8859-1 for some
characters, but I bet you are getting screwed by the UTF-8 to ISO-8859-1 conversion.

Note that the AOLserver version we use is patched so that you
can pass an encoding to ns_getform, to tell it explicitly what
charset the form data was posted in, so that it can convert
properly to Unicode in AOLserver Tcl strings. Look in the
tcl/modules/form.tcl file to see how it works.

Collapse
Posted by Peter Breugel on

There's a simpler method of doing this, even without the AD version of AOLServer (+ad13). You can use the current 3.4.2 AOLServer from aolserver.com or Sourceforge.

The only thing you need to do is place some undocumented params in your config.tcl:

ns_section "ns/encodings"
ns_param adp iso8859-1
ns_param tcl iso8859-1

Then it's even ok if you run "ASCII" encoding in your Postgres database, Latin1 will work also. I don't think (as by now) you have to use the (slow) Unicode stuff with Postgres.

Collapse
Posted by Tilmann Singer on
Although the POST-data / database interaction seems to work fine with 3.4.2 and those configuration settings aolserver is not able to pick up iso8859-1 encoded files in that setup (unless I am missing something). I created a .tcl file with some german umlauts in it and they were displayed as garbled two-character combination in the browser. Also a string defined in the .tcl file with umlauts and then saved in the database will incorrectly produce the two-byte string.

Is it right that part of Henry's patches need to be ported to 3.4.2 for that to work? Any chance for that happening in the near future?

And yet more questions:

by which factor would you estimate postgresql to be slower when dealing with multibyte stuff?

And would this slowdown only affect databases in multibyte encoding or is it already caused by compiling postgresql with multibyte enabled?

Collapse
Posted by Paul Doerwald on

For what it's worth, my solution was different than what everyone else on these boards has ever told me.

In fact, even this site (openacs.org) seems to be doing it wrong because this particular message is being sent to me iso-8859-1 encoded. It should be encoded utf-8 so that all the characters display properly.

As a test, if you're using IE, you can go to View | Encoding and play around a bit and see if the characters don't clear up.

Anyway, what I've found is that you have to set the encoding to UTF-8, and then the characters display properly.

What I've done in my nsd.tcl file is:

ns_param        HackContentType 1
ns_param        URLCharset      "utf-8"
ns_param        OutputCharset   "utf-8"
ns_param        HttpOpenCharset "utf-8"

I also changed several functions in ad-utilities.tcl.preload, particularly ReturnHeaders, ad_return_top_of_page, and all the other ReturnHeaders* functions to show:

Content-Type: $content_type; charset=utf-8

Then everything worked nicely. I suppose you have to have multibyte support compiled into PG, but based on some simple testing on my computer (I don't remember if I enabled multibyte support -- my databases are of type SQL_ASCII), and everything works just fine.

I hope this helps the people out there who are having troubles...

Collapse
Posted by Tilmann Singer on
But this way you are storing utf-8 encoded strings in the database without the database knowing about it. This could potentially lead to errors, e.g. when some code depends on the string length or is using postgresql's substring function.

Propably not very likely that it bites you, but possible.

Collapse
Posted by Henry Minsky on
To read a file in a specific encoding, you can always use "open" and
then set the channel encoding using the Tcl 'fconfigure -encoding' command.
Collapse
Posted by Jens Strupp on
Hi Paul,

the only thing you do is to assume EVERYONE in the world accessing
your site uses UTF8. If this would be the case there would never be
a discussion or problem about all this encoding stuff. (Of course
your DB also needs to be Unicode-trimmed, btw.).

But the fact is: You can't be sure what comes in and you may have to
return very different encodings for your outgoing contents.

Of course, almost all relevant browsers will be able to bring your
content up with Unicode/UTF8 encoding so you are save to some point.

If you know you never support, e.g., eastern content you will prefer
to set up your database with the correct ISOxy encoding as it simply
is faster for almost all SQL actions. And then again you have to
deal with AOLserver encodings.

Henry:
The request processor uses ns_adp_parse for the ADP files (I think),
and not "OPEN", and because the code is always single byte stuff,
they run into no problem.