Forum OpenACS Development: Catching HTML comments in html to text conversion

The current Tcl procedure ad_html_to_text does not take into account HTML comments.

This is fine except MS Outlook and Word put in some extremely ugly comments that make it through the text conversion.

For example:

<!--[if !mso]> v\:* {behavior:url(MESSAGE KEY MISSING: 'default'V\
ML);} o\:*
        {behavior:url(MESSAGE KEY MISSING: 'default'VML);} w\:* {behavior:url(M\
ESSAGE KEY MISSING: 'default'VML);}
        .shape {behavior:url(MESSAGE KEY MISSING: 'default'VML);} <![endif]--> \
<!-- /* Font
        Definitions */ @font-face {font-family:Wingdings; panose-1:5 0 0 0 0 0
            0 0 0 0;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4\
2
                4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoN\
ormal
        {margin:0in; margin-bottom:.0001pt; font-size:12.0pt;
            font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:bl\
ue;
                text-decoration:underline;} a:visited, span.MsoHyperlinkFollowe\
d
        {color:blue; text-decoration:underline;} p {mso-margin-top-alt:auto;
            margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in;
            font-size:12.0pt; font-family:"Times New Roman";} span.EmailStyle18
        {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page
        Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;}
        div.Section1 {page:Section1;} /* List Definitions */ @list l0
        {mso-list-id:669450480; mso-list-template-ids:145939189
            6;} @list
        l0:level1 {mso-level-number-format:bullet; mso-level-text:\F0B7;
            mso-level-tab-stop:.5in; mso-level-number-position:left;
            text-indent:-.25in; mso-ansi-font-size:10.0pt; font-family:Symbol;}
        @list l1 {mso-list-id:1015379521; mso-list-template-ids:-1243462522;}
        ol {margin-bottom:0in;} ul {margin-bottom:0in;} -->

I noticed this in incoming email where Outlook uses Word for composition.

I added some code to ad_html_to_text but I am not sure how to test it to make sure it doesn't do anything dumb.

I have uploaded a patch
https://openacs.org/bugtracker/openacs/patch?patch%5fnumber=854
and would appreciate some review.

Thanks.

Collapse
Posted by Don Baccus on
I think it's OK. HTML comments can't nest, so your straightforward lookahead for "-->" should be fine.

Since you're stripping out HTML comments, let me tease you by pointing out that the one error I found is in a Tcl comment...

"beleive noone"

oh, I'm bad - sorry, dave! :)

I am surprised no one noticed this before, but really, relatively few people use HTML comments, particularly in static HTML files.

And only MS is crazy enough to bury formatting info for other apps as comments in an HTML file. Well, I hope that's true. It should be true, if there's justice in the world and all that.