Forum OpenACS Q&A: Response to i18n woes on a very simple site

Collapse
Posted by Reuven Lerner on
One of the reasons for the popularity of UTF-8 as a Unicode encoding is the fact that it's backward compatible with ASCII.  One of the probems with UTF-8 is that it makes the underlying logic more difficult to implement (since each character may be any number of bytes wide), and that Asian-language documents become somewhat large.

Another Unicode encoding, known as UCS-2, uses two bytes for every character.  The good news is that every character is a fixed width.  The bad news is that it's not backward-compatible with ASCII.

However, an ASII document converted into UCS-2 is still readable by humans: It begins with two marker bytes, followed by the UCS-2 characters.  Each UCS-2 character consists of a null (zero-value) byte followed by its ASCII equivalent.  So "ABC" becomes "marker mark null A null B null C".

For what it's worth, Microsoft systems seem to have standardized on UCS-2.  The open-source languages and systems that I use, such as Perl, Python, Tcl, and PostgreSQL, instead seem to have adopted UTF-8.