Forum OpenACS Q&A: Unicode Characters?

Collapse
Posted by Gilbert Wong on
I'm getting these weird unicode characters (I think) appearing on
some of my pages. I'm running OpenACS 4/PostgrSQL. The character is
00c2 which is in the latin character set. It's an A with an caret on
top.

I have a suspicion that these characters are appearing from MS-
DOS/windows files which I am uploading to the server. They were
created using MS Frontpage in English. Is there a way I can use a TCL
regexp to delete these characters? I tried using the u00c2 in a
regexp and it doesn't find it.
Thanks.

Collapse
Posted by Henry Minsky on
You are probably getting Microsoft characters, which are not
valid ASCII or ISO-8859-1, or anything. Tcl will try to convert them
to unicode when it reads the page from disk, and assume they are 8859-1 when it reads them. The only solution is to make sure the
g**damn Microsoft characters are removed before you load the file, or else to tell AOLserver that the charset if CP1252 (I think that's the
official name for it).

There was a perl script that MarkD had call "demoronizer" (http://www.fourmilab.ch/webtools/demoroniser/) that would
do the substitutions. The problem is that if you try to do it in Tcl,
it's too late, unless you have set the channel encoding when you
read the file from disk. You can do this in Tcl, but for your sanity
I recommend you convert documents to ISO-8859-1 before trying to serve them from AOLserver. I.e., it is too painful to try to organize your documents into ISO-8859-1 and CP1252, better to just convert to
a common format. Microsoft's charset is I think so incompatible that some of their characters don't even have Unicode correspondents at all.

Collapse
Posted by Gilbert Wong on
Well after several hours testing different unicode sequences on AOLServer, I finally found the correct sequence: xa0 (which was no where near xc2).  A simple regsub and they're gone now :)