Forum OpenACS Development: About Charsets

1: About Charsets

Posted by Eduardo Pérez on 10/21/03 10:34 AM

What's the official charset of OpenACS? (UTF-8?)
I've seen that some message catalogs are using ISO-8859-* but others UTF-8.

I'm asking this because I've seen problems like:
- OpenACS sends UTF-8 without specifying a character encoding and browsers defaulting to ISO-8859-1 show garbage in non ASCII characters.

What's the point in having both forums.fi_FI.ISO-8859-15.xml and forums.fi_FI.utf-8.xml in CVS?

Are there any plans to make everything UTF-8?

2: Re: About Charsets (response to 1)

Posted by Tilmann Singer on 10/21/03 03:00 PM

There are different charsets involved, but the one recommended, 'official' charset to store the data in the database is in fact unicode. I guess that's a requirement since i18n has been added to openacs 5.0.

I think the possibility to store catalog files in another encoding than utf-8 is there because of text editor support - most editors don't let the user save a file in utf-8 yet.

Why there would be two different encodings for the same locale in CVS I don't know - maybe an oversight. Peter?

If you are using aolserver3.3ad13 then you can control the output charset of aolserver by setting these parameters in your config file:

ns_section "ns/parameters"
ns_param HackContentType 1
ns_param URLCharset iso-8859-1
ns_param OutputCharset iso-8859-1
ns_param HttpOpenCharset iso8859-1
ns_param DefaultCharset iso-8859-1

But this limits you to one charset for the whole site, hmm. I wonder what the recommended mechanism is to output in a user specific charset. Again -Peter? ;)

3: Re: About Charsets (response to 1)

Posted by Peter Marklund on 10/22/03 03:42 PM

Eduardo,
thanks for your posting!

The quick answer is that the idea is to send utf-8 from multilingual servers (this is what the translation server does). We set the charset in an HTTP header and Mozilla, IE, and Opera seem to understand this fine. We should probably set the charset in the HTML code as well (don't remember the syntax right now).

All locales that are not represented with ISO-8859-1 are exported to utf-8 catalog files. The finnish ISO-8859-15 catalog file should be removed, and I just did. For the record, here are the commands I used:

cd <packages_dir>
find -regex '.*catalog/.*\.xml' -maxdepth 3 -type f| grep -v 'ISO-8859-1.xml' | grep -v 'utf-8.xml' > /tmp/remove-files.txt
cat /tmp/remove-files.txt |xargs rm
cat /tmp/remove-files.txt |xargs cvs remove
cat /tmp/remove-files.txt |xargs cvs commit -m "removing obsolete catalog files"

4: Re: About Charsets (response to 1)

Posted by Eduardo Pérez on 10/22/03 05:06 PM

<blockquote> The quick answer is that the idea is to send utf-8 from multilingual servers (this is what the translation server does). We set the charset in an HTTP header and Mozilla, IE, and Opera seem to understand this fine. We should probably set the charset in the HTML code as well (don't remember the syntax right now).
</blockquote>

Setting the charset used in the HTML header is a good idea if someone is downloading the file and wants to open it later. I think most browsers don't put the HTTP charset in the HTML header if the HTML page lacks it.

<blockquote> All locales that are not represented with ISO-8859-1 are exported to utf-8 catalog files.
</blockquote>

Why?
Why not having all the catalog files in UTF-8 as (for example) the GNOME project does (with the po files)?

5: Re: About Charsets (response to 4)

Posted by Jeff Davis on 10/22/03 05:38 PM

> Why not having all the catalog files in UTF-8 as (for example) the GNOME project does (with the po files)?

The problem is that if people edit the file, unless they have their local editor local editor set to utf-8 it will mess up the file if they insert any high bit characters. At a guess I would say the majority of developers are running with their charset in their editor as iso-8859-1 or iso-8859-15.

The reason this came up is because it turns out that tcl has (or had maybe?) a problem with iso-8859-6 (arabic) where the numbers are mapped to the unicode arabic numeric code points and so iso-8859-6 -> utf-8 -> iso-8859-6 was not idempotent. The simple solution was to store arabic things in utf-8 which is how we ended up where we are.

Maybe it would be better to store everything in utf-8 but if we do, I expect we will end up with some messed up catalog files at some point (although this is true no matter what encoding we chose).

Eduardo, when you edit a file, is your editor in utf-8 or iso-8859-1?

6: Re: About Charsets (response to 1)

Posted by Eduardo Pérez on 10/22/03 08:41 PM

<blockquote> The problem is that if people edit the file, unless they have
their local editor local editor set to utf-8 it will mess up
the file if they insert any high bit characters. At a guess I
would say the majority of developers are running with their
charset in their editor as iso-8859-1 or iso-8859-15.
</blockquote>

I know it happens that people editing the files mess up with the charset. I've seen it happening in many projects.

<blockquote> Maybe it would be better to store everything in utf-8 but
if we do, I expect we will end up with some messed up
catalog files at some point (although this is true no
matter what encoding we chose).
</blockquote>

This can happen just like any other bug. People can make errors.

<blockquote> Eduardo, when you edit a file, is your editor in utf-8 or iso-8859-1?
</blockquote>

Rigth now, most of my files are encoded in iso-8859-1 and that's my default charset, but I can edit utf-8 files with my default editor easily (gvim --cmd 'set encoding=utf-8')