Forum OpenACS Q&A: Problems with Microsoft char set (eg 'smart quotes') in form input

When users submit posts (to bboard/comments etc) using Internet Explorer, so-called 'smart quotes' and other chars like em-dashes show up as garbage when displayed subsequently.

This is clearly an issue of setting up encoding correctly, but before I do anything drastic to several running systems, I'd appreciate a reality check on what needs to be done.

There are several potential knobs to twist:

  • The database encoding (PG 7.1.3): the dbs are currently SQL-ASCII (since that was the default long ago when I created them. I can reset them to UNICODE -- but are there any bad side effects of doing that on a running db? Do you need to create a new db and import the data from the old one instead?

    Furthermore, would setting the PG encoding to UNICODE take care of this problem, since it involves Microsoft's char set (cp-1252) not UNICODE? It appears here http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT that there are mappings from cp-1252 to unicode; are these 'automated' (whatever that might mean) in PG, AOLServer and Tcl?

  • Running all user input through a TCL proc to regsub the crap out. This seems like a brute force approach unworthy of an enlightened OpenACSer, but it could work.

  • Messing with AOLserver's mappings. In my install, there are no entries for charsets, though I see in this thread: https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0003jQ&topic_id=11&topic=OpenACS these params suggested:
    ns_section "ns/encodings"
           ns_param adp iso8859-1
           ns_param tcl iso8859-1
    
    and
    ns_section "ns/parameters"
           ns_param        URLCharset      "utf-8"
           ns_param        OutputCharset   "utf-8"
           ns_param        HttpOpenCharset "utf-8"
    

    Are entries such as these needed for AOLServer (I'm using 3.3ad13) to support UNICODE correctly?

  • Tcl version (I've seen some comments about 7.x to 8.x changes in threads here). But I'm using 8+ so that presumably isn't an issue.

The simple thing seems to just reset the db's encoding from SQL-ASCII to UNICODE and hope for the best. Is this insane?

This appears not to be a problem that others are having (or at least having to ask about). Are people's users not using IE or am I just victimized by using an older OpenACS install (still using 3.2.5 since it Does Its Job) 😉

Many thanks for any pointers!

John Walker (of autocad fame) wrote a perl script to regsub things like smart quotes. His script has since been modified by a few others (search for demoroniser on google). His original script along with a good summary of what the problem is can be found at http://www.fourmilab.ch/webtools/demoroniser/. If OACS doesn't already have something to do this, it might be worth adapting this code.

I also have been unable to find a satisfactory solution, and the brute force regexp method is a hack. Are there better ways to do this?

Thanks for the excellent pointers, Michael. Glad to know that this isn't merely a unicode issue. Though that makes a clean solution harder. I'm going to give Demoronizer a try.

On the same theme of Microsoft's perpetual "Your feet are the wrong size for your shoes so we're going to force you to buy ours" business model, note the recent I, Cringley analysis of Palladium here: http://www.pbs.org/cringely/pulpit/pulpit20020627.html

Stan,  [off topic reply]

Thankfully, OpenACS works well as an intranet solution (over local tcp/ip).

There's always fidonet.org to keep us connected when the internet start's requiring hourly resets; email message delivery becomes unreliable etc.  I'm sure an OpenACS api could be developed pronto if and when palladium becomes reality =)

Torben

Here's the code we use.

http://my.brandeis.edu/api-doc/proc-view?proc=br_demoronise&source_p=1

We filter posts to bboard, news, and calendar.

Rich, thanks for pointing to your code. I like your use of the string map Tcl proc rather than regsub (more elegant and besides Brent Welch uses it in a similar example for getting rid of "smart quotes" in his Tcl book).

We've found it useful to include a few more mappings for Microsoft's "smart fractions" etc. Here's our version of this "decrufing" proc:

proc_doc decruft { cruft } { 
Takes a string removes all the cruft introduced by Microsoft apps,
such as their 'smart quotes'. Brute-force approach suggested by John
Walker's Demoronizer, a Perl script which does a few other things that
aren't germane here.

This proc could get called lots of places, but to make it
automatically run against all user input, we call it from
ad_page_variables and (for backward compatibility since this 
still lurks in the code) set_the_usual_form_variables. It 
should be trivial to add it to page_contract  or whatever OACS 4.5+ uses.
} {    
#    ns_log Notice "Before De-Cruft: $cruft"

    set cruft [ string map [ list 
 
 x82 , x83 f x84 ,, x85 ... x86 t x87 I x88 ^ x89 { */**} x8a S x8b < x8c Oe x8d {} x8e Z x8f {} x90 {} x91 ` x92 ' x93 {"} x94 {"} x95 * x96 - x97 -- x98 ~ x99 tm x9a S x9b > x9c oe x9d {} x9e Z x9f Y xbd 1/2 xbc 1/4 xbe 3/4  ] $cruft ]

#    ns_log Notice "After De-Cruft: $cruft"

    return $cruft
}

In addition, instead of calling this proc within modules like bboard and news, we find it useful to push the call back into ad_page_variables (and set_the_usual_form_variables since that still gets called some places). That way it always gets called regardless of the destination of the form data. FWIW, here's how we do it:

proc_doc ad_page_variables {variable_specs} {

Current syntax:

ad_page_variables {var_spec1 [varspec2] ... }

    This proc handles translating form inputs into Tcl variables, and checking to see that the correct set of inputs was supplied.  Note that this is mostly a check on the proper programming of a set of pages.

Here are the recognized var_specs:

variable; means it's required
{variable default-value}
      Optional, with default value.  If the value is supplied but is null, and the
      default-value is present, that value is used.
{variable -multiple-list}
      The value of the Tcl variable will be a list containing all of the values (in order) supplied for that form variable.  Particularly useful for collecting checkboxes or select multiples.
      Note that if required or optional variables are specified more than once, the first (leftmost) value is used, and the rest are ignored.

{variable -array}
      This syntax supports the idiom of supplying multiple form variables of the
      same name but ending with a "_[0-9]", e.g., foo_1, foo_2.... Each value will be
      stored in the array variable variable with the index being whatever follows the
      underscore.

There is an optional third element in the var_spec.  If it is "QQ", "qq", or some variant, a variable named "QQvariable" will be created and given the same value, but with single quotes escaped suitable for handing to SQL.

Other elements of the var_spec are ignored, so a documentation string
describing the variable can be supplied.

Note that the default value form will become the value form in a "set"

Note that the default values are filled in from left to right, and can depend on values of variables to their left:
ad_page_variables {
    file
    {start 0}
    {end {[expr $start + 20]}}
}

} {
#   ns_log Notice "ad_page_variables"
    set exception_list [list]
    set form [ns_getform]
    if { $form != "" } {
        set form_size [ns_set size $form]
        set form_counter_i 0
        
        # first pass -- go through all the variables supplied in the form
        while {$form_counter_i<$form_size} {
            set variable [ns_set key $form $form_counter_i]
            set value [ns_set value $form $form_counter_i]
            check_for_form_variable_naughtiness $variable $value
            set found "not"
            # find the matching variable spec, if any
            foreach variable_spec $variable_specs {
                if { [llength $variable_spec] >= 2 } {
                    switch -- [lindex $variable_spec 1] {
                        -multiple-list {
                            if { [lindex $variable_spec 0] == $variable } {
                                # variable gets a list of all the values
                                upvar 1 $variable var
                                lappend var $value
                                set found "done"
                                break
                            }
                        }
                        -array {
                            set varname [lindex $variable_spec 0]
                            set pattern "($varname)_(.+)"
                            if { [regexp $pattern $variable match array index] } {
                                if { ![empty_string_p $array] } {
                                    upvar 1 $array arr
                                    set arr($index) [ns_set value $form $form_counter_i]
                                }
                                set found "done"
                                break
                            }
                        }
                        default {
                            if { [lindex $variable_spec 0] == $variable } {
                                set found "set"
                                break
                            }
                        }
                    }
                } elseif { $variable_spec == $variable } {
                    set found "set"
                    break
                }
            }
            if { $found == "set" } {
                upvar 1 $variable var
                if { ![info exists var] } {
                    # take the leftmost value, if there are multiple ones
                    set var [ns_set value $form $form_counter_i]
                }
            }
            incr form_counter_i
        }
    }
    
    # now make a pass over each variable spec, making sure everything required is there
    # and doing defaulting for unsupplied things that aren't required
    foreach variable_spec $variable_specs {
        set variable [lindex $variable_spec 0]
        upvar 1 $variable var
        
        if { [llength $variable_spec] >= 2 } {
            if { ![info exists var] } {
                set default_value_or_flag [lindex $variable_spec 1]
                
                switch -- $default_value_or_flag {
                    -array {
                        # don't set anything
                    }
                    -multiple-list {
                        set var [list]
                    }
                    default {
                        # Needs to be set.
                        uplevel [list eval set $variable "[subst [list $default_value_or_flag]]"]
                        # This used to be:
                        #
                        #   uplevel [list eval [list set $variable "$default_value_or_flag"]]
                        #
                        # But it wasn't properly performing substitutions.
                    }
                }
            }
            
            
        } else {
            if { ![info exists var] } {
                lappend exception_list ""$variable" required but not supplied. Bummer."
            }
        }
        # modified by rhs@mit.edu on 1/31/2000
        # to QQ everything by default (but not arrays)
        if {[info exists var] && ![array exists var]} {
            # Begin De-Cruft stuff here
#           ns_log Notice "Before De-Cruft: $var"
            set var [decruft $var]
#           ns_log Notice "After De-Cruft: $var"
            # End De-Cruft stuff here
            upvar QQ$variable QQvar
            set QQvar [DoubleApos $var]
        }
        
    }
    
    set n_exceptions [llength $exception_list]
    # this is an error in the HTML form
    if { $n_exceptions == 1 } {
        ns_returnerror 500 [lindex $exception_list 0]
        return -code return
    } elseif { $n_exceptions > 1 } {
        ns_returnerror 500 "<li>[join $exception_list "
<li>"]
"
        return -code return
    }
}

For amusement value, here's a demo we created that shows the problem and the fix: http://www.epimetrics.com/demos/decrufter?demo_id=7

Sorry to dredge up this old topic, but I can't seem to get either the demoroniser code or Stan's decruft code to replace the MS characters.  I typed "hello" in MS Word and copy/pasted it into a form I created.  I logged the before and after the "conversion" values of the string and they showed:

Before: \x93hello\x94
After: \xef\xbe\x93hello\xef\xbe\x94
(\x93 and \x94 are the MS smart quotes)

It looks like the code to replace the MS characters is adding more junk instead of removing them.  I can't figure out what is wrong.  I even stripped down the proc to search only for \x93 and it still adds the garbage.

set some_string [string map {x93 {"}} $some_string]

I'm using AOLServer 4.0 with TCL 8.4.

I would appreciate any help or insights.  Thanks!

Gilbert

After configuring our application to correctly demoronise strings in our application over 1 year ago, the problem resurfaced recently. I was unable to find out what specific change in our environment caused us to have to revisit this mapping problem (we had changed everything - server, operating system, version of aolserver, tcl, our custom code base, etc. - making it very difficult to track down the single cause).

In any case, here are a few relevant threads:

http://rhea.redhat.com/bboard-archive/webdb/000eUv.html

http://www.mail-archive.com/aolserver@listserv.aol.com/msg06119.html

In the end, we needed to set the oracle NLS_LANG environment variable and to use regsub instead of string map.

-Mike

Thanks Mike.  I changed the character encoding in AOLServer to iso8859-1 and that seems to take care of the problem.  I don't need to replace the characters.  The browsers I tested seem to be able to correctly display the MS characters.
Sorry for bumping but this may be of help to someone else. I had a problem with the apostrophe pasted from MS Word and converted it as follows:
set microsoft_apostrophe \u2019
regsub -all $microsoft_apostrophe $letter "'" letter

Ideally this should be replaced with the html entity 8217 but that doesn't play well with HTMLarea so instead I just used the standard apostrophe but here's the code to convert to the HTML entity
set apostrophe "\&#38;#8217;"
set microsoft_apostrophe \u2019
regsub -all $microsoft_apostrophe $letter $apostrophe letter

Brian