Forum OpenACS Q&A: Character Set Problem with web form input
I have a characterset problem reading german umlauts from web forms. (Using OpenACS4 and PG7.1 with LATIN-1 Encoding) These german umlauts are converted to strange characters as i can see from log file notices ( Ã¶ becomes ÃÂ¶ for example ). If written back to the browser these characters are converted back again and displayed correctly. But into the database ÃÂ¶ and friends are written. This conversion happens only with web form input. Data read from text files gets successfully written to the database (as Ã¶,Ã¤ ...). I have tried changing the [encoding system] but that did not help. The other Aol Server patches seemed to deal with writting back issues not with reading from forms. Interestingly ns_getform claims to use "iso8859-1 for charset iso- 8859-1" that would be correct, but then why are these strange characters generated ? TIA
I have not compiled Postgresql (psql -V tells me multibyte support is there), i am currently using an rpm version. As the support of the database for the german characters seems to work very well i thought that would not be necessary ?
Any Umlaut read from a text file is displayed correctly at any time (in psql, in the error-log written via ns_log, on the browser side)
But if i retrieve a value via form input, garbage is immediatly displayed when writing the value to the log:
set form_value [encoding convertfrom iso8859-1 [ns_set value $formdata $i]]
ns_log Notice $form_value --> Garbage (no database involved)
tils@tp:~$ psql -l
List of databases
Database | Owner | Encoding
oacs | tils | UNICODE
About the weird characters in the error log: that's ok - propably because the terminal or whatever you use to look at the log does not understand unicode. Also psql will display those weird characters in an otherwise correct setup. There are some older threads here on this topic where this is explained much better.
Depending on your version of OpenACS and Aolserver you might need some of Henry Minski's patches to make form input work correctly.
I run a recent cvs checkout of OpenACS 4 and Aolserver 3.3ad13, and that combination seems to contain all necessary fixes already.
but make sure that your database is configured for Unicode.
/usr/local/pgsql/bin/initdb --encoding unicode -D /usr/local/pgsql/data
createdb --encoding=unicode yourdbname
AOLserver internally uses UTF8 to talk to postgres, so I think
this is required. You might get away with ISO-8859-1 for some
characters, but I bet you are getting screwed by the UTF-8 to ISO-8859-1 conversion.
Note that the AOLserver version we use is patched so that you
can pass an encoding to ns_getform, to tell it explicitly what
charset the form data was posted in, so that it can convert
properly to Unicode in AOLserver Tcl strings. Look in the
tcl/modules/form.tcl file to see how it works.
There's a simpler method of doing this, even without the AD version of AOLServer (+ad13). You can use the current 3.4.2 AOLServer from aolserver.com or Sourceforge.
The only thing you need to do is place some undocumented params in your config.tcl:ns_section "ns/encodings"
ns_param adp iso8859-1
ns_param tcl iso8859-1
Then it's even ok if you run "ASCII" encoding in your Postgres database, Latin1 will work also. I don't think (as by now) you have to use the (slow) Unicode stuff with Postgres.
Is it right that part of Henry's patches need to be ported to 3.4.2 for that to work? Any chance for that happening in the near future?
And yet more questions:
by which factor would you estimate postgresql to be slower when dealing with multibyte stuff?
And would this slowdown only affect databases in multibyte encoding or is it already caused by compiling postgresql with multibyte enabled?
For what it's worth, my solution was different than what everyone else on these boards has ever told me.
In fact, even this site (openacs.org) seems to be doing it wrong because this particular message is being sent to me
iso-8859-1 encoded. It should be encoded
utf-8 so that all the characters display properly.
As a test, if you're using IE, you can go to View | Encoding and play around a bit and see if the characters don't clear up.
Anyway, what I've found is that you have to set the encoding to
UTF-8, and then the characters display properly.
What I've done in my
nsd.tcl file is:
ns_param HackContentType 1 ns_param URLCharset "utf-8" ns_param OutputCharset "utf-8" ns_param HttpOpenCharset "utf-8"
I also changed several functions in
ad_return_top_of_page, and all the other
ReturnHeaders* functions to show:
Content-Type: $content_type; charset=utf-8
Then everything worked nicely. I suppose you have to have multibyte support compiled into PG, but based on some simple testing on my computer (I don't remember if I enabled multibyte support -- my databases are of type
SQL_ASCII), and everything works just fine.
I hope this helps the people out there who are having troubles...
Propably not very likely that it bites you, but possible.
then set the channel encoding using the Tcl 'fconfigure -encoding' command.
the only thing you do is to assume EVERYONE in the world accessing
your site uses UTF8. If this would be the case there would never be
a discussion or problem about all this encoding stuff. (Of course
your DB also needs to be Unicode-trimmed, btw.).
But the fact is: You can't be sure what comes in and you may have to
return very different encodings for your outgoing contents.
Of course, almost all relevant browsers will be able to bring your
content up with Unicode/UTF8 encoding so you are save to some point.
If you know you never support, e.g., eastern content you will prefer
to set up your database with the correct ISOxy encoding as it simply
is faster for almost all SQL actions. And then again you have to
deal with AOLserver encodings.
The request processor uses ns_adp_parse for the ADP files (I think),
and not "OPEN", and because the code is always single byte stuff,
they run into no problem.