Forum OpenACS Q&A: i18n woes on a very simple site
I've built a number of i18n sites with UTF-8 on OpenACS 3.x and OpenACS 4.x. I would use the HackContentType parameters in nsd.tcl, and never had to think about it very much.
But I now have a client who wants, for various historical reasons, a static-only site, with the option of introducing dynamic items later on. So I basically have an OpenACS 4.5 installation that's using acs-templating, but no modules other than that.
The problem is that the client's graphic designer doesn't want to use Unicode. Instead, she wants to use files encoded in windows-1255, which is roughly equivalent to iso-8859-8 (aka English and Hebrew). I wouldn't think that this is a problem, but it has caused me no end of headaches:
- I wouldn't mind letting the encoding remain unspecified, and using a meta tag at the top of each .adp page indicate the encoding. But the Content-type header always comes with an explicit declaration of iso-8859-1.
- If I do nothing, then the files are declared to be iso-8859-1, meaning that the Hebrew characters look like Western European accented vowels.
- If I use HackContentType to set the encoding explicitly to iso-8859-8, then I get question marks instead of Hebrew characters.
- The encoding system and ns/mimetypes and ns/encodings tricks that I've seen mentioned on the bboard don't seem to have any effect. I still get the question marks.
- I even modified the acs-templating file-reading utility, such that it explicitly sets the encoding on the input file descriptor to iso-8859-8. No such luck.
- I'm definitely getting question marks sent in the HTTP response. My browser is accurately reflecting what it received; the problem is not a matter of fonts or browser confusion. ngrep clearly indicates that I'm getting a long string of 0x3F characters.
I assume that the problem is that OpenACS templates are somehow assuming that the files on disk -- which are encoded in iso-8859-8 -- are actually encoded in UTF-8, and that this is confusing someone. But I can't figure out just who is getting confused, or why the characters are being returned to me in a garbled state.
Any suggestions? I think that the client is beginning to think that I don't know what I'm talking about with i18n, and I'm beginning to wonder if they're right...
What does the unix file command say - it tells you the encoding of a file:
tils@tp:~$ file some.txt some.txt: ISO-8859 text
So the HTTP encoding part is working without a hitch. It's the file-reading part that is breaking down, somehow expecting UTF-8 when I feed it windows-1255.
Maybe I just didn't modify the right part of acs-templating? But should I have to do this? Hmmm...
It adds 2 relevant commands, ns_returnbinary and ns_writebinary.
I use the tcl rename command i.e."rename ns_return ns_return_classic", "rename ns_returnbinary ns_return" and my web site is able to return some files I get in Windows character sets without the question marks so it might work for your issue also.
Windows-1255 and ISO-8859-8 are virtually identical, differing only in a few characters. So they're basically interchangeable. That said, Tcl (and HTTP) do recognize the differences between the two, and support both (as cp1255 and iso885908 in Tcl).
I'm convinced at this point that the problem is in the reading of the file on the AOLServer side. AOLServer/OpenACS seems to assume that ADP pages are all encoded in UTF-8. If I can somehow change this assumption, my problems will be solved. Until then, AOLServer will continue to read every 8-bit character in my file as an illegal UTF-8 character, and thus barf on every Hebrew character in there.
Just to be sure this detail hasn't tripped you up... tcl and the AOLserver tcl api respectively use different naming conventions for charsets.
So reading in your file on disk you'd use:
fconfigure $fd -encoding "iso8859-8"
Returning the page you'd use:
ns_return 200 "text/html; charset=iso-8859-8"
(note extra hyphen in second case).
Anyway, maybe you want to check if aolserver has correctly transformed the characters internally to unicode from the file by trying an adp like this:
bytelength: <%=[string bytelength $s]%>
It should produce a higher value for bytelength. At least it does for me when I insert some german umlauts in the string.
<p>Basically, if you take a file in windows-1255/iso-8859-8 and read it into a program expecting to see UTF-8, all of the Hebrew characters are turned into question marks. The Latin characters remain unchanged, as you might expect.</p>
<p>So I really do need some way of telling AOLServer to read in ADP and HTML files in an encoding other than UTF-8. I'll try the solutions posted here earlier, but if someone knows of a good, simple fix, I'd be happy to hear it!</p>
handling this problem although I am not entirely sure if it was
find them in packages/acs-lang/www/doc/ in the distribution. In particular there is a section in i18n-design.html "VI.B Naming of
Template Files To Encode Language and Character Set" which might have
some useful information. I am not entirely sure if this stuff was
implemented and you should be aware that Lars and his Collaboraid gang
are currently doing a major overhaul of acs-lang so even if it did
work it might not on the HEAD (then again maybe it didn't work and now it will).
In particular, Henry gives an example
rp_handle_adp_request proc, which reads in .adp files
set mimetype [ns_guesstype [ad_conn file]] set encoding [ns_encodingfortype $mimetype] set fd [open [ad_conn file] r] fconfigure $fd -encoding $encoding set template [read $fd] close $fd
you had to patch the adp handler if your files were in anything
except ISO-LATIN or maybe UTF-8 encoding. The AOLserver adp handler that
reads files from disk has some kind of hardcoded expectation of
what encoding the file is in. So instead, you need to patch
ACS's adp handler to read the file into a Tcl string, using the
channel encoding of your choice when you open the file, and then
pass it to the adp interpreter with the -string option.
I had a similar hack for plain .tcl files, the request processor
was patched to first set the encoding before reading the file.
These hacks let me read Japanese ShiftJIS encoded .tcl and .adp files properly.
Emacs can be coerced to leave file encodings alone, but it was
always trying to do something "helpful" if you turned your back on it.
<blockquote>> Emacs can be coerced to leave file encodings alone, but it was always trying to do something "helpful" if you turned your back on it.
Anyone know of a free tool that is good for reading/converting/writing file encodings?
I received a call from the graphic designer today. She said that everyone was asking (impatiently) when the site will be ready. I told her that the site is ready, but that I'm still working on getting OpenACS to handle windows-1255 encoding. I told her (once again) that if she's willing to work in Unicode, then I can simply run "recode" over the entire system, and it'll be ready within seconds.
She said that if it's not too hard to work in Unicode (which is a question of operating system, applications, and all sorts of other fun, for those who haven't ever had to deal with end users on this issue), then she's willing to try it. I gave her a link to the Windows version of GNU recode, transformed the site, and everything worked fine.
So thanks again to everyone here, but it seems that life is just easier in general when everyone works with Unicode -- and the time I spent investigating and playing with this dwarfs the time that she'll have to invest in learning "recode".
"Anyone know of a free tool that is good for reading/converting/writing file encodings?"
Does tclsh count?
Something along the lines of:
set input [open $input_file]I've never worked with encodings, but it seems like Tcl would make a great language to write a simple command line utility in to do this...
set output [open $output_file]
fconfigure $input -encoding $input_encoding
fconfigure $output -encoding $output_encoding
puts $output [read $input]
But if you're simply looking for a good command-line utility, you should check out GNU recode. It knows how to translate between any two encodings.
Another Unicode encoding, known as UCS-2, uses two bytes for every character. The good news is that every character is a fixed width. The bad news is that it's not backward-compatible with ASCII.
However, an ASII document converted into UCS-2 is still readable by humans: It begins with two marker bytes, followed by the UCS-2 characters. Each UCS-2 character consists of a null (zero-value) byte followed by its ASCII equivalent. So "ABC" becomes "marker mark null A null B null C".
For what it's worth, Microsoft systems seem to have standardized on UCS-2. The open-source languages and systems that I use, such as Perl, Python, Tcl, and PostgreSQL, instead seem to have adopted UTF-8.