Forum OpenACS Improvement Proposals (TIPs): Tip#93: Add optional support to automatically remove smart quotes from textareas
What is being proposed?
- Add new procedure to utilities-procs.tcl
- Add parameters to acs-mail-lite and acs-templating to enable/disable this feature when sending email and submitting/retreiving richtext and textarea values respectivly
This solves a common problem with text formatting that always gets noticed by users when content is pasted into a text area from MS Word.
The submitted data gets modified when it is inserted into the database.
It isn't just smart quotes that needs to be handled. I use the following regular expressions in a perl script to replace other smart stuff.
# Replace MS stuff.
$content =~ s/\342\200\230/'/g;
$content =~ s/\342\200\231/'/g;
$content =~ s/\342\200\246/.../g;
$content =~ s/\342\200\223/-/g;
$content =~ s/\240/|/g;
$content =~ s/\342\200\234/"/g;
$content =~ s/\342\200\235/"/g;
So dashes, dots, and pipes are also affected.
E.g. this ASCII Demoroniser, which "corrects moronic Microsoft HTML": http://www.fourmilab.ch/webtools/demoroniser/
has a Unicode variant:
If we set this up correctly, we can then easily create things like valid xml feeds for content without worrying about what users type in text fields (presently I get feed validation errors for any non-typical input: e.g. special characters in German or Icelandic) and worry less about any xhtml validation errors caused by users as the cleanliness of the toolkit's markup increases with time.
Thanks for pointing that out. Yes we do want to store the data in unicode format, it should already be formatted that way, and the fact that Word uses illegals characters is the problem. I'll take at look at unmoroniser and make sure we are converting the data properly. Basically we are just converting to regular quotes etc, which are supported in all encodings.
There isn't much hope of converting user input to unicode, the forms should only accept unicode data in the first place. (BTW, the default openacs config.tcl should be configured to support unicode input at least since 5.1, possibly earlier, you might want to check yours.)