Forum OpenACS Q&A: little problem with nsd8x

Collapse
Posted by David Kuczek on
I have been running the AOLserver with a symbolic link in /bin to
nsd76 until the day before yesterday as I saw a thread on the
openacs4.0 Design forum that you should rather use nsd8x.

(I couldn't post that link as a link again - it cut everything after
the "?") https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0001Hp&topic_id=12&topic=OpenACS%204%2e0%20Design

So I tried nsd8x and got a weird output for special German characters
like "ä, ö", which I didn't get using nsd76. The output
looked like "A&x" or something like that.

The important thing about it is that I was too lazy to write an "&
auml;" for every "ä" in the first place, as I translated the
openacs texts into German.

As it is better to use the html standard than to be lazy I want to
change all the ä,ö,ü etc. in my documents into their html-synonym. Is
there a command in linux similar to "grep" where I can search all the
documents in my tree for those characters and change them all in one
sweep.... This would be very cool.

Thanks

Collapse
Posted by Henry Minsky on
Actually that's a little can of worms you just found. Tcl 7 did not use Unicode internally,
so it had the feature/bug of preserving 8 bit characters sometimes.

Tcl 8 uses Unicode strings internally, and thus needs to be told what encoding
input and output are in, so it can convert properly to Unicode. This allows Japanese
and other non-ASCII encodings to be handled properly.

However, the AOLserver maintainers did not properly fix AOLserver to use Unicode.
So ArsDigita ended up having to patch AOLserver. YOu need to use the latest
ArsDigita release (+ad12). It pretty much will default to ISO-8859-1 encoding
I think, unless Rob set it to use UTF8 (I thought I saw something about this).

You should look at my patches at http://www.ai.mit.edu/people/hqm/openacs and http://imode.arsdigita.com/i18n
for some advice on how to deal with character sets. But setting
up URLCharset and OutputCharset to ISO-8859-1 in your .tcl init file should mostly work.
The issue is how .tcl and .adp files are sourced from disk. It used to be
.tcl files would get interperted using the defualt tcl system encoding, which was
usually iso-8859-1. But adp files were read as raw UTF8. You need to patch
things a little to get consistent behavior.

Although Rob put the needed hooks
into AOLserver, the toolkit  developers at ArsDigita didn't make it a real priority
to get charset encoding integrated properly into the ACS toolkit releases, since everyone speaks English,
right? At least everyone with money in their pockets, and it was  hard to test for
the  developers.

My hope is that when the OpenACS 4 release is out, we can integrate and document how to  control  charset
encoding  in a nice consistent manner, and we can integrate that with the acs-lang module which
handles message catalogs for translation, and some other goodies like Tcl routines internationalized time and
date formatting.

Collapse
Posted by David Kuczek on
Hello Henry,

we only have 4 characters in the German language that are different. They are ä ö ü ß.

When I change the URLCharset and OutputCharset to ISO-8859-1, will those characters be outputted correctly on even an english browser?

What about the grep command to change certain things in every document... Or wouldn't you recommend doing that, if it is possible at all. There should be a command which scans all the .tcl documents and you have to confirm or skip on a matching pattern.
Collapse
Posted by Henry Minsky on
ISO-8859-1 is supposed to handle English, German, French, Spanish, and
maybe some others as well. So you should be able to
author your files in ISO-8859-1 without any special HTML escapes.

If you .adp files are in ISO-8859-1, though, you may need my
patch which reads the .adp script into Tcl using the
desired encoding. If you just let the system call ns_adp_parse
on a file, I believe it will try to read it in as UTF8. You can
convert your .adp file to UTF8 if you want, but that is a pain.
It is better to use a patch which reads the .adp file
into a Tcl string and then calls ns_adp_parse with the -string option.

Collapse
Posted by Andrew Piskorski on
David, someone has probably written a perl script to do just that.

But, I've done a lot more shell hacking than Perl. I've never needed to traverse a whole directory trees, but here's a really simple bourne shell script to replace "foo" with "bar" and "xx" with "yy", for files in a single directory only. It just dumps the new versions of the files into the tmp/ subdir. You'd run it with: do-files.sh *.tcl or the like.

#!/bin/sh
# do-files.sh

mkdir tmp
for x
  cat $x | sed -e 's%foo%bar%g' | 's%xx%yy%g' > tmp/$x
done
Hm, you could probably cobble together something ugly like:
find /web/mydir -name "*.tcl" -exec my-script.sh {} ;
where my-script.sh is:
#!/bin/sh
# my-script.sh

set dir_tree `dirname $1`
set tmp_tree "/tmp${dir_tree}"

if [ ! -r $tmp_tree ] ; then
  mkdir -p $tmp_tree
fi

cat $x | sed -e 's%foo%bar%g' | 's%xx%yy%g' > $tmp_tree/$x
Of course, I haven't tested that at all.
Collapse
Posted by Andrew Piskorski on
test
Collapse
Posted by Andrew Piskorski on
That's odd, BBoard ate the backslash ('') before the semicolon (';') in my find command, above. Here is is again: find /web/mydir -name "*.tcl" -exec my-script.sh {} ;
Collapse
Posted by Andrew Piskorski on
Damn, even if I run all my text through ns_quotehtml first, BBoard
still swallows up the the backslash before the semi-colon!

Or, if I don't run it throught ns_quotehtml, posting to this thread
then fails on:

  .../bboard/insert-msg.tcl

with an error message of:

  Database operation "dml" failed 0001K9

Collapse
Posted by Andrew Piskorski on
\
Collapse
Posted by Andrew Piskorski on
Mm, a single backslash with nothing else in the form text box causes
the 'Database operation "dml" failed 0001K9' error.  Two backslashes
end up as a single backslash.  Sounds like a problem with Tcl escape
sequences....
Collapse
Posted by Don Baccus on
It's Postgres, actually, which thinks it is C even though it's just an  RDBMS that happens to be written in C.

The PG folks want to be SQL92 compliant but unfortunately there's such  a large user base that the chances of them changing this are about as high as the chances of Oracle deciding that '' isn't the same as NULL.

We should be double-backslashing quoted strings ... hmmm ... I might be able to do that in the driver.

For real grins see what happens if you stick backslash-zero-zero-zero in your post!

Collapse
Posted by Brian Mann on
Collapse
Posted by David Kuczek on
I return to my original question about the charset.

My browser still outputs ö etc. when I set the symbolic link in aolserver/bin to nsd8x...

It didn't bother me running nsd76 until today, but now I would like to use tcl8.x for some regular expressions...

I set the following in /home/aolserver/nsd.tcl:

ns_section "ns/parameters"

        ns_param  home        $homedir

        ns_param  debug        false

        ns_param  MailHost    localhost

        ns_param  ServerLog    ${homedir}/log/server.log

        ns_param  LogRoll      on

        ns_param  HackContentType 1

        ns_param  URLCharset      iso-8859-1

        ns_param  OutputCharset  iso-8859-1

        ns_param  HttpOpenCharset iso-8859-1

ns_section "ns/mimetypes"

        ns_param  default        "*/*"    ;# MIME type for unknown extension

        ns_param  noextension    "*/*"    ;# MIME type for missing extension

        #ns_param  ".xls"        "application/vnd.ms-excel''

        ns_param .html "text/html; charset=iso-8859-1"

        ns_param .tcl "text/html; charset=iso-8859-1"

        ns_param .adp "text/html; charset=iso-8859-1"

Any other suggestions?