Forum OpenACS Development: New Feature: Formbuilder maxlength

Collapse
Posted by Lars Pind on
I've added a new feature to form builder elements: "maxlength".

Use like this:

element create myform term_name -label "Term name" -datatype text -maxlength 20

What it'll do is add a maxlength="20" attribute to the input widget.

But moreover, it'll also validate the value on the server side, with a call to [string bytelength], which correctly handles multibyte characters.

The error message to the user will be "Term name is 3 characters too long". The reason we don't explicitly say what the maxlength is, is that it depends on the presence of multibyte characters. So telling him to remove 3 will always work. If he removes multibyte characters, he could get away with removing fewer, but if he removes 3, he's guaranteed to be safe.

I've also added it to ad_form:

{term_name:text {label "Term name"} {maxlength 20}}

Please use liberally on all your forms, so we can avoid those nasty DB errors causing 500 internal server errors, just because the user typed a few characters too much.

/Lars

Collapse
Posted by Jeff Davis on
I am not sure bytelength is the right thing to use. I guess in general it will be conservative but if your db is utf-8 and you have a varchar(20), isn't is 20 characters even if they are multibyte? Conversely if your db is iso-8859-1 and you enter high bit characters bytelength will say 2 bytes but the representation in the DB will be one byte.
Collapse
Posted by Lars Pind on
Hm. I did this with my PG installation, and bytelength was what corresponded to the PG interpretation of my varchar(20).

Don't know about the character set of my PG installation. How do I find out?

What would the right way to check for maxlength before the page blows up be?

/Lars

Collapse
Posted by Michael Hinds on
Lars,

I'm not sure why you don't want to use string length. Here's what the manual says about bytelength

string bytelength string Returns a decimal string giving the number of bytes used to represent string in memory. Because UTF-8 uses one to three bytes to represent Unicode char¡ acters, the byte length will not be the same as the character length in general. The cases where a script cares about the byte length are rare. In almost all cases, you should use the string length operation. Refer to the Tcl_NumUtfChars manual entry for more details on the UTF-8 representation.

So it seems to me string length works fine. Have you seen evidence otherwise?

Collapse
Posted by Tilmann Singer on
Type psql -l to find out the encoding of your pg databases:
tils@tp:~$ psql -l
        List of databases
   Name    |  Owner   | Encoding
-----------+----------+----------
 beta      | tils     | UNICODE
 lari      | tils     | UNICODE
 lari2     | tils     | UNICODE
...
If you have something else, for example SQL_ASCII, in there then those are single byte encoded databases. As far as I understand it's in almost any case the right thing to create your database as UNICODE when you want to be able to store data in different encodings.

The error that your maxlength procedure catches indicates that something else is going wrong before, because in that case you would end up storing a single international character (e.g. a german umlaut) as two characters in the db, which leads to lots of other problems. For example a query that selects a substring could split the 2-byte character in two pieces. You should have created your database UNICODE encoded or in the encoding that understands the characters that you need (e.g. LATIN1).

Collapse
Posted by Lars Pind on
Yes, I had the problem that the Danish letters æ, ø, and å, took up two bytes each in the DB row, and this fixed it.

Checking psql -l, indeed all my databases are in SQL_ASCII. How do I fix that now?

Switching from bytelength to length is trivial, thankfully.

/Lars

Collapse
Posted by Lars Pind on
Looks to me like our documentation is wrong.

https://openacs.org/doc/openacs-4/openacs.html

It doesn't say anything about setting UNICODE encoding, AFAICT.

I don't even see anything in Joel's new doc.

http://aufrecht.org/doc/unix-install.html

/Lars

Collapse
Posted by Tilmann Singer on
As far as I know there is no way to change the encoding of an existing database in postgresql apart from pg_dump'ing the contents, recreating the database in the desired new encoding and importing the data. I've never done that myself so I don't know wether it's necessary to specify encodings for pg_dump or psql when importing. You propably need to find a trick to tell pg to export the chars that were wrongly saved as 2 characters as one, or run a regexp over the dump file.

Regarding the missing documentation, I added a comment to the installation page.