Forum OpenACS Q&A: Getting started: adding fields to registration screen
- I need to add three mandatory fields to the user information: two
text fields (first & family name in Russian spelling) and an item
picked from a list (user location). The items should be asked for
at registration and should be editable in user's workspace.
- How to add them? What files to edit and how to add the columns to the users table?
- I need to change the default encoding of all HTML pages to
Windows-1251.
-
The stuff
... text/html; charset=windows-1251
need be listed both in server response and in the HEAD section of HTML source of all and every page; what should I modify to achive this?- Are there any problems with PostgreSQL if I feed it characters in 128-255 code range (in fields, not in column names)?
-
The stuff
If I modify the files that make log in, workspace, etc. in the core, does that mean I wouldn't be able to upgrade to a new release, or there's a way around it?
You can create a separate table for this and add the sql for the table creation wherever it makes the most sense for you. Key that table on object_id. OR you could modify the default object creation code and add these as attributes of the user object. If you don't ever think you will have to add other attributes to the object on a production server after you deploy this would be the way to go.
text/html; charset=windows-1251 need be listed both in server response and in the HEAD section
For the head section, modify the default-master template. I'm not sure about the server response.
Are there any problems with PostgreSQL if I feed it characters in 128-255 code range
Make sure your postgres installation was created with the right charset. I don't know them offhand. Or are they all 8 bit nowadays?
Is it an OK workflow to modify the files on the development server, debug it there and then copy everything to the production one? If I'm developing alone, can I get away with not using CVS for some time?
I recommend going with CVS right from the beginning. You can tag your files when you push to production so if you find something that worked on development but is hosing production you can roll back.
If I modify the files that make log in, workspace, etc. in the core, You always run that risk. Read these:
- https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=00036T&topic_id=11&topic=OpenACS
- http://www.cvshome.org/docs/manual/cvs_13.html#SEC104
- https://openacs.org/new-file-storage/download/cvs.html?version_id=140
- http://www.piskorski.com/cvs-conventions.html
See also encodings.html that comes stock with AOLServer distro, normally in the root dir. Not sure if it would fit below, but heck --worth a try...
Character Encoding in AOLserver 3.0 and ACS
by Rob Mayoff
Note that this document applies only to the Tcl 8 version of
AOLserver, also known as nsd8x
, because Tcl 7 has
no internationalization support. This document is also mainly
concerned with the AOLserver Tcl API,
because that is what we
use at ArsDigita. There are probably problems in the C API as
well that are not covered here.
Contents
- The Problem
- Terminology
- Database Access
- Configuration Files
- Tcl Files
- Output from Tcl
- Content Files (not Tcl or ADP Scripts)
- ADP scripts
- ADP - NsTclIncludeCmd
- URL Encoding
- URL Path
- Form Data in
application/x-www-form-urlencoded
Format - Form data in
multipart/form-data
Format - Cookies
ns_httpopen / ns_httpget
- References
The Problem
Here's a simple example of the problem: you have a file on disk, named "hello.html" and stored using the ISO-8859-1 encoding:
(That should say "Gunther" with an umlaut on the "u".) Since it's in ISO-8859-1 encoding, the u with umlaut is stored as one byte with value xFC. Suppose you send this file to the user using this script:
set content [read $fd [file size /web/pages/hello.html]]
close $fd
ns_return 200 text/html $content
Then the user will probably see this:
(That should say "Gunther" with an umlaut on the "u".) But suppose you send this file using this script:
set content [read $fd [file size /web/pages/hello.html]]
close $fd
regsub {Hello.} $content {Hello!} content
ns_return 200 text/html $content
Then the user will probably see this:
(That should say "GA1/4nther" with a tilde on the "A" and the "1/4" as a fraction.) What happened? The reason it worked in the first case is that by default, AOLserver just ships out the raw bytes from the (ISO-8859-1-encoded) file, and the HTTP standard says that the client must assume a charset of ISO-8859-1 if no other charset is specified. The file encoding and the browser encoding matched, and AOLserver sent the data unmodified, so everything worked.
The second case is different. It turns out that Tcl 8.1 and later
use Unicode. The interpreter normally stores strings using the UTF-8
encoding (which uses a variable number of bytes per character),
and sometimes converts them to UCS-2 encoding (which uses 16-bit
"wide characters"). The regsub
command is one of those
cases where conversions are involved. First, regsub
converted the string to UCS-2. Tcl's UTF-8 parser is lenient, so
the transformation ended up translating xFC
into
x00FC
. (This happens to be the correct translation
because UCS-2 is a superset of ISO-8859-1.) Then regsub
did its matching and substitution. Then it converted the UCS-2
representation back to UTF-8. The UTF-8 encoding of x00FC
is xC3 xBC
. AOLserver does not know anything about UTF-8;
it just sends whatever bytes you give it. In ISO-8859-1, xC3 means
à and xBC means ¼.
So regsub
didn't do anything wrong. We gave it
garbage (a non-UTF-8 string), so it gave us garbage. How do we solve
this problem? We need to make sure that all of AOLserver's textual
input is translated to its UTF-8 representation and that the UTF-8 is
translated to the appropriate character encoding on output.
Terminology
A character encoding is a mapping from a set of characters to a set of octet sequences. US-ASCII maps all of its characters to a single octet each. UTF-8 maps its characters to a variable number of octets each.Charset is synonymous with "character encoding"; Internet standards use this term.
Tcl 8.1 and later use Unicode and UTF-8 internally and include support for converting between character encodings. The Tcl names for various encodings are different than the Internet standard names. So, in this document, I typically use the term "encoding" when I am referring to Tcl, and "charset" when I am referring to an Internet protocol feature.
Database Access
For database access, the only sane choice is to use a database that supports UTF-8. Then Tcl strings can be passed to and from the database client library unmodified. Trust me, you just want to use a UTF-8 database.Configuration Files
AOLserver reads its configuration files (both Tcl and ini-format) with no character encoding translation. This means that you must store AOLserver configuration files in UTF-8.Configuration Files
AOLserver supports Tcl source files in your Tcl library and under your PageRoot. In either case, it reads the files using the Tcl "source" command, which uses the Tcl "system encoding" when it reads the files. In AOLserver, the system encoding is UTF-8. Therefore you must store your Tcl source in UTF-8 format. The simplest strategy (if you do not have a UTF-8 editor) is to use only US-ASCII bytes, and represent any other characters using the xXX notation (for any ISO-8859-1 character) or the uXXXX notation (for any Unicode character).Content Files
By "content file", I mean a file containing data to be sent to the client, not a file containing a program to be run. So an HTML or JPEG file is a content file, but a Tcl script is not.AOLserver has several APIs for sending the contents of a file directly to the client. All of them send the contents of the file back to the client unmodified - no character encoding translation is performed. This means that it is up to you to ensure that the file's encoding is the same as the encoding the client expects.
The safest thing is to use only US-ASCII bytes in your text files - bytes with the high bit clear. Just about every character encoding you're likely to run across on the Web will be a superset of US-ASCII, so no matter what charset the client is expecting, your content will probably be displayed correctly. If you are sending an HTML (or XML) file, it can still access any Unicode character using the &#nnn; notation. However, if you have non-HTML files, or you don't want to deal with all those character reference entities, you'll have to make sure your client knows what character set you're sending.
The client knows what character set to expect from the Content-Type header. You're probably used to seeing a header like this:
Typically, you determine the content-type to send for a file by
calling ns_guesstype
on it. ns_guesstype
looks up the file extension in AOLserver's file extension
table to pick the content-type. The default table is in the
AOLserver manual. Some of the default mappings are:
Extension | Type |
---|---|
.html | text/html |
.txt | text/plain |
.jpg | image/jpeg |
nsd.ini | nsd.tcl |
---|---|
[ns/mimetypes] .html=text/html; charset=iso-8859-1 .txt=text/plain; charset=iso-8859-1 |
ns_section ns/mimetypes ns_param .html "text/html; charset=iso-8859-1" ns_param .txt "text/plain; charset=iso-8859-1" |
nsd.ini | nsd.tcl |
---|---|
[ns/mimetypes] .html=text/html; charset=iso-8859-1 .txt=text/plain; charset=iso-8859-1 .html_sj=text/html; charset=shift_jis .txt_sj=text/plain; charset=shift_jis .html_ej=text/html; charset=euc-jp .txt_ej=text/plain; charset=euc-jp |
ns_section ns/mimetypes ns_param .html "text/html; charset=iso-8859-1" ns_param .txt "text/plain; charset=iso-8859-1" ns_param .html_sj "text/html; charset=shift_jis" ns_param .txt_sj "text/plain; charset=shift_jis" ns_param .html_ej "text/html; charset=euc-jp" ns_param .txt_ej "text/plain; charset=euc-jp" |
set fd [open somefile.html_sj r] fconfigure $fd -encoding shiftjis set html [read $fd [file size somefile.html_sj]] close $fd ns_return 200 "text/html; charset=euc-jp" $html
XXX ACS: ad_serve_html_file
Output from Tcl
Your Tcl programs (Tcl files, filters, and registered procs) can send content to the client using a number of commands:ns_writefp
ns_connsendfp
ns_returnfp
ns_respond
ns_returnfile
ns_return
(and variants likens_returnerror
)ns_write
Tcl stores strings in memory using UTF-8. However, when you send content to the client from Tcl, you may not want the client to receive UTF-8; he may not support it. So AOLserver can translate UTF-8 to a different charset.
If you use ns_return
or ns_respond
to send a Tcl string to the client, AOLserver determines what
character set to use by examining the content type you specify:
- If your content-type includes a charset parameter, then AOLserver translates the string to that charset.
- Otherwise, if your content-type is
text/anything
, then AOLserver translates the string to the charset specified in the config file byns/parameters/OutputCharset
(iso-8859-1 by default). - Otherwise, AOLserver sends the string unmodified.
In the second instance, where AOLserver uses
the ns/parameters/OutputCharset
, if
ns/parameters/HackContentType
is also set to true, then
AOLserver will modify the Content-Type header to include the charset
parameter. HackContentType is set by default, and I strongly
recommend leaving it set, because it's always safer to tell the
client explicitly what charset you are sending.
For example, the default configuration is equivalent to this:
[ns/parameters] OutputCharset=iso-8859-1 HackContentType=true
$html
will be converted to
the ISO-8859-1 encoding as they are sent to the client.
If you write the headers to the client with ns_write
instead of letting AOLserver do it (via ns_return
or ns_respond
), then AOLserver does not parse
the content-type. You must explicitly tell it what charset
to use immediately after you write the headers, by calling
ns_startcontent
in one of these forms:
- ns_startcontent
- Tells AOLserver that you have written the headers and do not wish the content to be translated.
- ns_startcontent -charset charset
- Tells AOLserver that you have written the headers and wish the following content to be translated to the specified charset.
- ns_startcontent -type content-type
- Tells AOLserver that you have written the headers and
wish the following content to be translated to the charset
specified by
content-type
, which should be the same value you sent to the client in the Content-Type header. Ifcontent-type
does not contain a charset parameter, AOLserver translates to ISO-8859-1.
ns_choosecharset
command will return the best charset
to use, taking into account the Accept-Charset header and the charsets
supported by AOLserver. The syntax is
The ns_choosecharset
algorithm:
- Set
preferred-charsets
to the list of charsets specified by the -preference flag. If that flag was not given, use the config parameterns/parameters/PreferredCharsets
. If the config parameter is missing, use{utf-8 iso-8859-1}
. The list order is significant. - Set
acceptable-charsets
to the intersection of the Accept-Charset charsets and the charsets supported by AOLserver. - If
acceptable-charsets
is empty, return the charset specified by config parameterns/parameters/DefaultCharset
, oriso-8859-1
by default. - Choose the first charset from
preferred-charsets
that also appears inacceptable-charsets
. Return that charset. - If no charset in
preferred-charsets
also appears inacceptable-charsets
, then choose the first charset listed in Accept-Charsets that also appears inacceptable-charsets
. Return that charset.
(Note: the last step will always return a charset because
acceptable-charsets
can only contain charsets
listed by Accept-Charsets.)
Example:
# Assume japanesetext.html_sj is stored in Shift-JIS encoding. set fd [open japanesetext.html_sj r] fconfigure $fd -encoding shiftjis set html [read $fd [file size japanesetext.html_sj]] close $fd set charset [ns_choosecharset -preference {utf-8 shift-jis euc-jp iso-2022-jp}] set type "text/html; charset=$charset" ns_write "HTTP/1.0 200 OK Content-Type: $type " ns_startcontent -type $type ns_write $html
URL Encoding
Whether a URL is made up of "characters" or "bytes" is a complex issue (see RFC 2396 for details). Ultimately, though, URIs are transmitted over the network, so they must be reduced to bytes. However, HTTP limits the set of bytes used to transmit a URL. URLs containing bytes outside that set must be encoded for transmission.In URL encoding, one byte may be encoded as three bytes which in US-ASCII represent a percent character ("%") followed by two hexadecimal digits.
After a URL is decoded, any bytes less that x80 represent US-ASCII characters. The problem with URLs and URL encoding is that historically, no standard defined what bytes larger than x80 represent. Various proposals such as IURI Internet-Draft propose using UTF-8 exclusively as the character encoding in URLs, but existing software does not work that way.
AOLserver's ns_urlencode
and
ns_urldecode
choose the character encoding to use
in one of three ways:
- If the command was invoked with a
-charset
flag, use that charset. For example:ns_urlencode -charset shift_jis "u304b"Unicode character U+304B is HIRAGANA LETTER KA. In Shift-JIS this is encoded as x82 xA9, so the command returns the string "%82%A9". - If no
-charset
flag was given, then thens_urlcharset
command determines what encoding is used. Thens_urlcharset
sets the default charset for thens_urlencode
andns_urldecode
commands for one connection. For example, these commands have the same result as the preceding example:ns_urlcharset shiftjisThe
ns_urlencode "u304b"ns_urlcharset
command is only valid when called from a connection thread. Do not call it from anns_schedule_proc
thread. - If neither of the preceding steps specified a charset,
then the AOLserver config parameter
ns/parameters/URLCharset
determines the charset. The default value for the parameter is "iso-8859-1".
A URL, as seen by AOLserver in an HTTP request, consists of two parts, the path and the query. For example:
/register/user-new.tcl path |
? |
first_names=Rob&last_name=Mayoff query |
We will consider the path part and the query part separately.
URL Path
AOLserver decodes the path part of the URL in the HTTP request before determining how to handle the URL. It does not run any Tcl code in the connection thread first, so AOLserver always uses the charset specified byns/parameters/URLCharset
to decode the path.
You must use the same charset to encode URLs you send out,
or you will have problems.
However, other people might link to you from their servers and might be careless about the character encodings. So the safest practice is to use only US-ASCII characters in your URL paths if you possibly can.
Form Data in
application/x-www-form-urlencoded
Format
Form data comes from one of two places:
- In an HTTP GET request, the query data is the part of
the request URL following the first x3F byte (the first
question mark). This data is always in
application/x-www-form-urlencoded
format.Okay, it could be raw data from an <ISINDEX> page, but that tag is deprecated in HTML 4.0. Let's simplify our lives by pretending it doesn't exist.
- In an HTTP POST request, the query data is the request
contents, following the request header. By default this
data is in
application/x-www-form-urlencoded
format. The other format is covered under POST Data inmultipart/form-data
Format.
If you always send data in a single charset, and you always
specify the charset in the Content-Type header, then it is safe
to assume that form data is always encoded using that charset.
Just make that your ns/parameters/URLCharset
and
don't worry about it.
If you cannot limit yourself to a single charset, then you
need to use some other technique. No matter how you do it,
you must call ns_urlcharset
before calling
ns_conn form
or ns_getform
.
If you call ns_urlcharset
after you've asked
AOLserver for the form, it will not work retroactively.
Here are two ways you could determine the charset:
- Include a hidden field in all your forms, to indicate
the charset. Example:
The chicken-and-egg problem here is that you need the contents of a form field in order to decode the form. Fortunately, all charset names use only US-ASCII characters, so you can extract the# myform.tcl set _charset [ns_choosecharset] ns_return 200 "text/html; charset=$_charset" " <form action='myform-2.tcl'> <input type='hidden' name='_charset' value='$_charset'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "_charset
field from the query string without decoding it. The predefined commandns_formfieldcharset
will do this for you:# myform-2.tcl ns_formfieldcharset _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.ns_formfieldcharset
callsns_urlcharset
, so this will affect all further use ofns_urlencode
andns_urldecode
for that connection, unless you callns_urlcharset
again. - Use a cookie to store the last charset you sent to the
user. Example:
There is no chicken-and-egg problem here, but AOLserver still provides the predefined command# anotherform.tcl set _charset [ns_choosecharset] ns_set put [ns_conn outputheaders] Set-Cookie _charset=$_charset ns_return 200 "text/html; charset=$_charset" " <form action='anotherform-2.tcl'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "ns_cookiecharset
to set the URL encoding from a cookie:
Using a cookie has the big drawback that a cookie is not associated with a single web page. So if the user uses his back button, or has a page cached, or has multiple windows open, the wrong cookie value might be sent back to us.# myform-2.tcl ns_cookiefield _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.
Form Data in
multipart/form-data
Format
The browser sends data in multipart/form-data
format when the FORM tag says
enctype='multipart/form-data'
. This format is
based on the MIME standard and allows file upload (which
application/x-www-form-urlencoded
does not).
Alas, multipart/form-data
format is no better than
application/x-www-form-urlencoded
format as far as
character encoding issues are concerned. The MIME multipart format
allows each form field to include its own Content-Type header
with a charset parameter, but in practice clients do not send
any indication of the charset used. So we must resort to the
same tricks to decide what charset the data is in: always use
the same charset, or use a hidden field or a cookie to determine
the charset.
The ns_formfieldcharset
and
ns_cookiecharset
commands work for fields in
multipart/form-data
format except file upload
fields. We cannot know what character set the user stores
his files in, so we don't know how to translate an uploaded
file to utf-8 (assuming the uploaded file is even a text file).
So the temporary files created by ns_getform
contain the exact bytes sent by the client.
If you hand non-UTF-8 data to the Oracle client library when it thinks you are handing it UTF-8 data, it may crash. So when you are inserting an uploaded file into a CLOB, it is imperative that you run the file contents through Tcl's encoder first. I have not figured out a satisfactory way to automate this yet.
Cookies
The browser should not mess with cookie values; it should just send back exactly the bytes you sent it. However, it is common to URL-encode cookie values that might otherwise have unsafe characters in them. You need to be careful to use the same character encoding for encoding and decoding cookie values.ns_httpopen / ns_httpget
The ns_httpopen
command now parses the Content-Type
header from the remote server and sets the encoding on the read file
descriptor appropriately. If the content from the remote server is a
text type but no charset was specified, then ns_httpopen
uses the config parameter ns/parameters/HttpOpenCharset
,
which specifies the charset to assume the remote server is sending
(iso-8859-1
by default).
References
- How to Use Tcl 8.1 Internationalization Features
- RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1
- RFC 2070: Internationalization of the Hypertext Markup Language
- IANA Registered Charsets
- RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax
- RFC 1345: Character Mnemonics & Character Sets
- RFC 2130: The Report of the IAB Character Set Workshop
- CJKV Information Processing