Forum OpenACS Development: cr_write_content and utf-8

Collapse
Posted by Michael Totschnig on
Hello,

if you use cr_write_content with the string switch to read a utf-8 encoded file into a string, utf-8 goes corrupted. The reason is that "fconfigure $fd -translation binary" is called before reading the file. This line was introduced by Malte and the CVS comment reads
"Support for relative file locations which might even reside on a windows drive and therefore start with d:\"

I do not understand how this is related, but I think cr_write_content should be fixed to be able to read utf-8. This affects for example file-storage when it displays the content of a text file.

Michael

Collapse
Posted by Stefan Sobernig on
Salut,


fconfigure $fd -translation binary

Well, I cannot comment to Malte's intentions, I guess this comment is not directly related to the fconfigure line. it is rather a by-product.

the problem is that, while binary translation as such is fine (anything else would be platform-bound with respect to eofs), it has itself a side-effect:


[...] and sets the encoding to binary (which disables encoding filtering) [...]
(quoting fconfigure's man http://docs.activestate.com/activetcl/8.4/tcl/TclCmd/fconfigure.htm#M11)

setting "-translation binary" implies (if not set explicitly) "-encoding binary", which is certainly not intended and corrupts character sets that need to be interpreted with a certain binary format (e.g. utf8).

so setting encoding explicitly to the system's default should suffice (to cover non-utf8 environments):

"fconfigure $fd -translation binary -encoding [encoding system]"

//stefan

Collapse
Posted by Michael Totschnig on
thank you Stefan for the explanation. I tested your suggestion and it correctly reads files encoded both in utf-8 and in latin-1 on my system, where the system encoding is utf-8.
Can this be commited to CVS?
Collapse
Posted by Stefan Sobernig on
Ok, thanks for reporting back ...

Can this be commited to CVS?
Yep, now that I know it works I will commit it to head, whether this (from my point of view major) fix goes into a stable branch shall be decided by oct.

I will also check for other occurrences of the issue ...

//stefan

Collapse
Posted by Don Baccus on
HEAD is the right place. We have no intention of releasing from the oacs-5-4 branch, nor will there any merge to HEAD from that branch.
Collapse
Posted by Brian Fenton on
Hi

I'm having a related problem but I don't know if it's my setup or an issue with OpenACS. I have CR items stored on the file system and when I use cr_write_content -string to write them to a different location on the file system, they appear to be getting corrupted. Here is my proc below. Note that I have played around with my own fconfigure line below, adding the -encoding flag too, but the file is still getting corrupted. Interestingly, I can get it to work by hacking cr_write_content to just use "-translation binary" (without the -encoding part).

To test this, upload a Word document or pdf to the content repository, then call the proc below in the developer shell (with the item_id of your uploaded file and the location you wish to put the new file), and then try to open the created file.

I'd be grateful if someone could try this and let me know if it's an OpenACS issue or something in my setup.

ad_proc -public pdf::write_file_to_file_system {
{-item_id:required}
{-filename:required}
} {
} {

#Initialise
set return_val 1

if { [catch {
set file_data [cr_write_content -string -item_id $item_id ]
set file_handle [open $filename "w"]
fconfigure $file_handle -translation binary
puts -nonewline $file_handle $file_data
close $file_handle
} errmsg] } {
ns_log Error "pdf::write_file_to_file_system item_id=$item_id filename=$filename : $errmsg"
set return_val 0
}

return $return_val
}

Collapse
Posted by Emmanuelle Raffenne on
Hi Brian,

I had a similar problem with cr_write_content. When content_type is "file" and -string is set, it:

fconfigure $fd -translation binary -encoding [encoding system]

I checked the encoding of our server (debian) in tclsh and using the devsup shell, and to my surprise I would get different results, utf-8 in tclsh but iso-8859 in the devsup shell. However on my laptop (mac), both return utf-8. The diffence between our server and my laptop, beside the OS, is the server is running AOLserver 4.5 and my laptop 4.0.10.

We couldn't figure out why "encoding system" was set to iso-8859 on our server so our workaround was to add "encoding system utf-8" in the config file.

I hope it will help.

Collapse
Posted by Emmanuelle Raffenne on
Correction:

Where it says 'When content_type is "file"', it should read 'When storage_type is "file"'.

Collapse
Posted by Brian Fenton on
Thanks Emmanuelle! That looks useful - I'll take a look at our OS. Does my test case work on your system? I'd be very grateful if you could report back.

thanks
Brian

Collapse
Posted by Gustaf Neumann on
Emmanuelle,

are you sure, you were running on your server system tclsh and aolserver with the same environment variables and linked against the same tcl shared libs? I see with both, Mac OS X and lenny/sid with aolserver 4.5.1 and 4.0 in tclsh and ds/shell always utf-8 for [encoding system].

background: During initialization, Tcl determines the default system encoding from the LC_* or LANG environment variables. If nothing can be found, it uses TCL_DEFAULT_ENCODING, which is set depending on the OS. For example, under Mac OS X the TCL_DEFAULT_ENCODING is utf-8. If configure can't determine anything, the final default system encoding is "iso8859-1". Later, Tcl's system encoding can be altered on the scripting layer via "encoding system ?XXX?" or from C via Tcl_SetSystemEncoding(). Aolserver 4.0.10/4.5.1 does not set it via Tcl or C, naviserver has a config variable named "systemencoding" and sets the encoding in init.tcl (if nothing specified, it defaults to utf-8).

note, that when you load a library file or a www/*tcl script that sets the encoding via "encoding system ...", it is set for the whole server (all threads). The system encoding is a global variable in the Tcl implementation. The only OpenACS package that sets the system encoding is lors-central (most likely, not a good idea).

It is a good idea to check the LANG variable in your startup script for aolserver and use in doubt something like LANG=en_US.UTF-8

Hope this helps and all the best
-gustaf neumann

Collapse
Posted by Emmanuelle Raffenne on
Hi Gustaf,

Thanks for your answer.

I am not sure about the configuration of the server at installation time, I need to check with Héctor on that. From what I can see, LANG is set to es_ES.UTF-8 or en_US.UTF-8 for all the users involved (aolserver one, etc), the default being es_ES.UTF-8.

Regarding setting "encoding system" from inside OpenACS, I already grep'd the whole tree when we first noticed the difference and indeed the only one that sets it is lors-central but in our case 1. we don't use it, 2. it sets it to utf-8 anyway.

Also, trying to run Brian's test case on my mac (so UTF-8 in all cases then), I noticed that "fconfigure $channel -translation binary" would use iso8859-1 unless -encoding is set. I tested with a text file, encoded using utf-8. The new file encoding is iso8859-1. Note that in the content I use spanish specific characters like "ñ".

:-S

Collapse
Posted by Gustaf Neumann on
Do you say that LANG of nsd is set to en_US.UTF-8 and the result of [encoding system] is "iso8859-1"?
Collapse
Posted by Emmanuelle Raffenne on
Gustaf,

Yes, it's what I am saying :S.

Héctor and I just checked again, in case we were missing something, but same result. The user who runs AOLserver has LANG set to UTF-8 (we tried with both en_US.UTF-8 and es_ES.UTF-8 just in case) and still get iso8859-1 when running "encoding system" in the Tcl script of dev-support, while we get "utf-8" from tclsh. Very strange.

Collapse
Posted by Gustaf Neumann on
Just in case: there is a difference between logging in as a user and running a command as a user. If you execute the command

set ::env(LANG)

in ds/shell, do you get "en_US.UTF-8" as result? What Tcl version are you using on the server in question?

Collapse
Posted by Emmanuelle Raffenne on
Hi Brian,

I've run your testcase with word and pdf files and indeed files get corrupted, and it's not encoding related IMO. For example: the original word file was 140K and the copy was 154K. I didn't dig into it to understand why it happens.

As Dave said, I would suggest to use "file copy" or "fcopy" for your purpose, much easier.

Collapse
Posted by Brian Fenton on
Thanks Emmanuelle for trying the test case. Yes I think I'll probably use file copy for the cases where we store files on the file system (I haven't tested yet if the code works on systems where we store files in the database) but I thought I'd see if anyone here has a fix for the problem.

For completeness, here are the settings on my system:
set ::env(LANG) in ds/shell returns en_IE.UTF-8
encoding system in ds/shell returns utf-8
encoding system in tclsh returns utf-8
locale on the OS returns LANG=en_IE.UTF-8
echo $NLS_LANG on the OS returns .UTF8

thanks again
Brian

Collapse
Posted by Dave Bauer on
Why don't you use file copy instead?

This seems quite overly complex to copy a CR file that is already stored in the filesystem.

I don't understand the issues with converting to a string. I suspect if you have a non-text file read into a string, you have to keep it as binary.

Collapse
Posted by Brian Fenton on
Thanks for the reply. I can't use file copy as the files may be stored in the database, so I have to code in a way that covers both cases.