Forum OpenACS Improvement Proposals (TIPs): Tip#93: Add optional support to automatically remove smart quotes from textareas

A common and documented problem with text input is having cruft coming from Microsoft smart quotes and other Word specific characters.

What is being proposed?

- Add new procedure to utilities-procs.tcl
- Add parameters to acs-mail-lite and acs-templating to enable/disable this feature when sending email and submitting/retreiving richtext and textarea values respectivly

Advantages

This solves a common problem with text formatting that always gets noticed by users when content is pasted into a text area from MS Word.

Disadvantages

The submitted data gets modified when it is inserted into the database.

This is a much needed feature. Especially since I find users draft stuff up in Word, then copy/paste into a blog for example.

It isn't just smart quotes that needs to be handled. I use the following regular expressions in a perl script to replace other smart stuff.

# Replace MS stuff.
$content =~ s/\342\200\230/'/g;
$content =~ s/\342\200\231/'/g;
$content =~ s/\342\200\246/.../g;
$content =~ s/\342\200\223/-/g;
$content =~ s/\240/|/g;
$content =~ s/\342\200\234/"/g;
$content =~ s/\342\200\235/"/g;

So dashes, dots, and pipes are also affected.

I don't think it is a drawback if the submitted data gets modified for display on the web. After all if we don't modify it the text formatting is strange.

Approved.

In general we should move towards storing user input as Unicode in the database and then depend on the display logic to take care of presenting pages with browser/html friendly text as needed (which should be the default).

E.g. this ASCII Demoroniser, which "corrects moronic Microsoft HTML": http://www.fourmilab.ch/webtools/demoroniser/

has a Unicode variant:
http://rheme.net/unmoroniser/

If we set this up correctly, we can then easily create things like valid xml feeds for content without worrying about what users type in text fields (presently I get feed validation errors for any non-typical input: e.g. special characters in German or Icelandic) and worry less about any xhtml validation errors caused by users as the cleanliness of the toolkit's markup increases with time.

Carl,

Thanks for pointing that out. Yes we do want to store the data in unicode format, it should already be formatted that way, and the fact that Word uses illegals characters is the problem. I'll take at look at unmoroniser and make sure we are converting the data properly. Basically we are just converting to regular quotes etc, which are supported in all encodings.

There isn't much hope of converting user input to unicode, the forms should only accept unicode data in the first place. (BTW, the default openacs config.tcl should be configured to support unicode input at least since 5.1, possibly earlier, you might want to check yours.)

Not sure if I got the rules right, but two yes and no nos make this approved? Interesting that nearly no OCT members bother to comment on TIPs nowadays.