Forum OpenACS Q&A: i18n woes on a very simple site

Collapse
Posted by Reuven Lerner on

I've built a number of i18n sites with UTF-8 on OpenACS 3.x and OpenACS 4.x. I would use the HackContentType parameters in nsd.tcl, and never had to think about it very much.

But I now have a client who wants, for various historical reasons, a static-only site, with the option of introducing dynamic items later on. So I basically have an OpenACS 4.5 installation that's using acs-templating, but no modules other than that.

The problem is that the client's graphic designer doesn't want to use Unicode. Instead, she wants to use files encoded in windows-1255, which is roughly equivalent to iso-8859-8 (aka English and Hebrew). I wouldn't think that this is a problem, but it has caused me no end of headaches:

  • I wouldn't mind letting the encoding remain unspecified, and using a meta tag at the top of each .adp page indicate the encoding. But the Content-type header always comes with an explicit declaration of iso-8859-1.
  • If I do nothing, then the files are declared to be iso-8859-1, meaning that the Hebrew characters look like Western European accented vowels.
  • If I use HackContentType to set the encoding explicitly to iso-8859-8, then I get question marks instead of Hebrew characters.
  • The encoding system and ns/mimetypes and ns/encodings tricks that I've seen mentioned on the bboard don't seem to have any effect. I still get the question marks.
  • I even modified the acs-templating file-reading utility, such that it explicitly sets the encoding on the input file descriptor to iso-8859-8. No such luck.
  • I'm definitely getting question marks sent in the HTTP response. My browser is accurately reflecting what it received; the problem is not a matter of fonts or browser confusion. ngrep clearly indicates that I'm getting a long string of 0x3F characters.
  • I assume that the problem is that OpenACS templates are somehow assuming that the files on disk -- which are encoded in iso-8859-8 -- are actually encoded in UTF-8, and that this is confusing someone. But I can't figure out just who is getting confused, or why the characters are being returned to me in a garbled state.

    Any suggestions? I think that the client is beginning to think that I don't know what I'm talking about with i18n, and I'm beginning to wonder if they're right...

Collapse
Posted by Tilmann Singer on
Are you sure that it's not a problem specific to the windows-1255 encoding and that it wouldn't work either when using iso-8859-8 encoded files?

What does the unix file command say - it tells you the encoding of a file:

tils@tp:~$ file some.txt 
some.txt: ISO-8859 text

Collapse
Posted by Reuven Lerner on
After doing some experimentation, it seems like the problem definitely has to do with how the files are read on disk.  When I put all of the encoding directives in nsd.tcl, I find that a windows-1255 file is sent to me as question marks, whereas a UTF-8 file is sent to me just fine.

So the HTTP encoding part is working without a hitch.  It's the file-reading part that is breaking down, somehow expecting UTF-8 when I feed it windows-1255.

Maybe I just didn't modify the right part of acs-templating?  But should I have to do this?  Hmmm...

Collapse
Posted by David Walker on
Experiment with my ns_binarysupport module from http://www.vorteon.com/download/

It adds 2 relevant commands, ns_returnbinary and ns_writebinary.
I use the tcl rename command i.e."rename ns_return ns_return_classic", "rename ns_returnbinary ns_return" and my web site is able to return some files I get in Windows character sets without the question marks so it might work for your issue also.
Collapse
Posted by Tilmann Singer on
Did you already try to serve an iso-8859-8 encoded adp file instead of this windows-1255 stuff? Did that work?
Collapse
Posted by Reuven Lerner on
Tilmann,

Windows-1255 and ISO-8859-8 are virtually identical, differing only in a few characters.  So they're basically interchangeable.  That said, Tcl (and HTTP) do recognize the differences between the two, and support both (as cp1255 and iso885908 in Tcl).

I'm convinced at this point that the problem is in the reading of the file on the AOLServer side.  AOLServer/OpenACS seems to assume that ADP pages are all encoded in UTF-8.  If I can somehow change this assumption, my problems will be solved.  Until then, AOLServer will continue to read every 8-bit character in my file as an illegal UTF-8 character, and thus barf on every Hebrew character in there.

Collapse
Posted by Alex Sokoloff on
Reuven,

Just to be sure this detail hasn't tripped you up... tcl and the AOLserver tcl api respectively use different naming conventions for charsets.

So reading in your file on disk you'd use:

fconfigure $fd -encoding "iso8859-8"

Returning the page you'd use:

ns_return 200 "text/html; charset=iso-8859-8"

(note extra hyphen in second case).

Collapse
Posted by Tilmann Singer on
Ok, for some reason you refuse to test it with an iso-8859-8 file, I have to accept that ... ;-) I was just thinking that aolserver *might* have a problem with windows specific encodings, and that you *might* narrow the cause of your problem by testing it with iso-8859-8. Looking into the lib/tcl8.3/encoding/ directory of aolserver I see that the windows encodings are there, so *propably* they work the same ...

Anyway, maybe you want to check if aolserver has correctly transformed the characters internally to unicode from the file by trying an adp like this:

<master> &lt;% set s "string with weird characters" %> length:&lt;%=[string length $s] %> <p> bytelength: &lt;%=[string bytelength $s]%>

It should produce a higher value for bytelength. At least it does for me when I insert some german umlauts in the string.

Collapse
Posted by Reuven Lerner on
<p>Well, I've proven beyond a shadow of a doubt -- playing with Emacs a bit earlier today -- that the problem is the file encoding.</p>

<p>Basically, if you take a file in windows-1255/iso-8859-8 and read it into a program expecting to see UTF-8, all of the Hebrew characters are turned into question marks.  The Latin characters remain unchanged, as you might expect.</p>

<p>So I really do need some way of telling AOLServer to read in ADP and HTML files in an encoding other than UTF-8.  I'll try the solutions posted here earlier, but if someone knows of a good, simple fix, I'd be happy to hear it!</p>

Collapse
Posted by Jeff Davis on
Reuven, did you read the design doc in acs-lang?  It does talk about
handling this problem although I am not entirely sure if it was
implemented.
Collapse
Posted by Reuven Lerner on
Jeff, I know that I've read docs for acs-lang before, but I'm not sure what you're specifically referring to.  Would you mind providing a link?
Collapse
Posted by Reuven Lerner on
Never mind -- I found it in my copy of the OpenACS CVS tree.
Collapse
Posted by Jeff Davis on
For some reason the docs are not on dev.openacs.org but you can
find them in packages/acs-lang/www/doc/ in the distribution.  In particular there is a section in i18n-design.html  "VI.B Naming of
Template Files To Encode Language and Character Set" which might have
some useful information.  I am not entirely sure if this stuff was
implemented and you should be aware that Lars and his Collaboraid gang
are currently doing a major overhaul of acs-lang so even if it did
work it might not on the HEAD (then again maybe it didn't work and now it will).
Collapse
Posted by Andrew Piskorski on
Reuven, you're having problems with character encodings when reading in ADP files from disk, right? Did you look at Rob's and Henry's docs on AOLserver character set issues?

In particular, Henry gives an example rp_handle_adp_request proc, which reads in .adp files like so:

set mimetype [ns_guesstype [ad_conn file]]
set encoding [ns_encodingfortype $mimetype]
set fd [open [ad_conn file] r]
fconfigure $fd -encoding $encoding
set template [read $fd]
close $fd

Collapse
Posted by Henry Minsky on
I am forgetting mostly what the details were, but I do remember that for certain
you had to patch the adp handler if your files were in anything
except ISO-LATIN or maybe UTF-8 encoding. The AOLserver adp handler that
reads files from disk has some kind of hardcoded expectation of
what encoding the file is in. So instead, you need to patch
ACS's adp handler to read the file into a Tcl string, using the
channel encoding of your choice when you open the file, and then
pass it to the adp interpreter with the -string option.

I had a similar hack for plain .tcl files, the request processor
was patched to first set the encoding before reading the file.
These hacks let me read Japanese ShiftJIS encoded .tcl and .adp files properly.

Emacs can be coerced to leave file encodings alone, but it was
always trying to do something "helpful" if you turned your back on it.

Collapse
Posted by Alex Sokoloff on
Henry said:

>> Emacs can be coerced to leave file encodings alone, but it was always trying to do something "helpful" if you turned your back on it.

Anyone know of a free tool that is good for reading/converting/writing file encodings?

Collapse
Posted by Reuven Lerner on
Well, everyone on the bboard has been super-helpful.  In the end, your help wasn't needed:

I received a call from the graphic designer today.  She said that everyone was asking (impatiently) when the site will be ready.  I told her that the site is ready, but that I'm still working on getting OpenACS to handle windows-1255 encoding.  I told her (once again) that if she's willing to work in Unicode, then I can simply run "recode" over the entire system, and it'll be ready within seconds.

She said that if it's not too hard to work in Unicode (which is a question of operating system, applications, and all sorts of other fun, for those who haven't ever had to deal with end users on this issue), then she's willing to try it.  I gave her a link to the Windows version of GNU recode, transformed the site, and everything worked fine.

So thanks again to everyone here, but it seems that life is just easier in general when everyone works with Unicode -- and the time I spent investigating and playing with this dwarfs the time that she'll have to invest in learning "recode".

Collapse
Posted by Michael A. Cleverly on
"Anyone know of a free tool that is good for reading/converting/writing file encodings?"

Does tclsh count? :-)

Something along the lines of:

set input [open $input_file]
set output [open $output_file]
fconfigure $input -encoding $input_encoding
fconfigure $output -encoding $output_encoding
puts $output [read $input]
close $output
close $input
I've never worked with encodings, but it seems like Tcl would make a great language to write a simple command line utility in to do this...

Collapse
Posted by Reuven Lerner on
Tcl's support for different encodings is indeed excellent; you could certainly create a command-line utility with it.

But if you're simply looking for a good command-line utility, you should check out GNU recode.  It knows how to translate between any two encodings.

Collapse
Posted by Alex Sokoloff on
I just installed recode last night, as a matter of fact. The documentation wasn't so fun to plow through. Using tclsh will be really straightforward... just the thing I was after. And there it was, right under my nose! ;-)
Collapse
Posted by Alex Sokoloff on
While we're discussing charsets... Does anyone know which character endodings ASCII is compatible with? Is it all of them? I know, for example, you can open a file of ASCII characters as utf-8 or iso-8859-1, and all will be well. But does the same carry over to other commonly used encodings, like shift_jis or big5? Hmm, easy enough to test....
Collapse
Posted by Reuven Lerner on
One of the reasons for the popularity of UTF-8 as a Unicode encoding is the fact that it's backward compatible with ASCII.  One of the probems with UTF-8 is that it makes the underlying logic more difficult to implement (since each character may be any number of bytes wide), and that Asian-language documents become somewhat large.

Another Unicode encoding, known as UCS-2, uses two bytes for every character.  The good news is that every character is a fixed width.  The bad news is that it's not backward-compatible with ASCII.

However, an ASII document converted into UCS-2 is still readable by humans: It begins with two marker bytes, followed by the UCS-2 characters.  Each UCS-2 character consists of a null (zero-value) byte followed by its ASCII equivalent.  So "ABC" becomes "marker mark null A null B null C".

For what it's worth, Microsoft systems seem to have standardized on UCS-2.  The open-source languages and systems that I use, such as Perl, Python, Tcl, and PostgreSQL, instead seem to have adopted UTF-8.