Forum OpenACS Development: Re: Ain’t No ☀shine

The short answer to your question, whether or not the tip version of NaviServer will work on that kind of data is: test it! I have set up a test environment with Tcl 8.5, and yes, it works.

The whole situation with Tcl byte arrays is tricky (to use a polite word). I am pretty sure, you do not want to know the all details but since you brought it up, here it goes. Probably, i have not got everything completely correct, and for sure, details change between Tcl versions.

A "bytearray" is an internal type of an Tcl_Obj for interpreting the internal representation of a Tcl value (not necessarily a Tcl variable). There are situations, when a Tcl byte array is really representing binary data (e.g. content of an image), then there are situations, where byte arrays are used, where content has a string representation in UTF-8, but which does not fit into Tcl's internal UCS-2 representation (e.g. >2 byte UTF-8 characters), and situations, where one has a subset of UTF-8 that fits into UCS-2. Depending on this situation, Tcl has an internal machinery trying to detect, when it is safe to use a byte array directly, depending on the fact, whether or not the byte array has a string rep (the latter is called a pure byte array). Unfortunately, it is also possible to create a string rep in situations, where one should not have a string rep (e.g when the one interactively enters a command returning a pure byte array, or when the Tcl C-API function Tcl_GetStringFromObj() is called on such an Tcl_Obj, eg. in a ns_log or some other command used for debugging. In these cases, it happens easily, that a wrong representation is chosen. Here comes the converto/convertfrom into play, which mostly here for converting to/from the UCS-2 rep and producing proper pure Tcl byte arrays.

For a Tcl application developer, it is not so easy to know, when convertto is necessary. Some good Tcl guys are working to improve the situation.

For example, starting Tcl 8.7a1, Tcl has now two types of Tcl byte arrays, the classical one and a so called "proper" bytearray, replacing the former pure bytearray, and making it more robust against creating string reps. I addition to these, there were many changes to address various bug reports from that area.

Just the indicator "german umlauts are working or not" is not a good indicator, whether the encoding is right. Here is a pure Tcl example, showing the "umlaut" correctly (since it is a 2-byte char) but the black sun is damaged. The code without the converto operations works only fine for "a" and "ü" (UTF-8 2 char), bot not for "☀" (UTF-8 3 char).

foreach v {"a" "ü" "☀"} {
   puts "v <$v> v1 <[binary decode base64 [binary encode base64 $v]]>"
}

with convert operations at the right places, it works fine.

foreach v {"a" "ü" "☀"} {
   puts "v <$v> v1 <[encoding convertfrom utf-8 [binary decode base64 [binary encode base64 [encoding convertto utf-8 $v]]]]>"
}

Nobody is happy about this, people are working to make it better on Tcl 9.

Hope this helps
-gn